首页 > 最新文献

arXiv - QuanBio - Genomics最新文献

英文 中文
Allium Vegetables Intake and Digestive System Cancer Risk: A Study Based on Mendelian Randomization, Network Pharmacology and Molecular Docking 葱类蔬菜摄入量与消化系统癌症风险:基于孟德尔随机化、网络药理学和分子对接的研究
Pub Date : 2024-09-16 DOI: arxiv-2409.11187
Shuhao Li, Jingwen Lou, Yelina Mulatihan, Yuhang Xiong, Yao Li, Qi Xu
Background: Allium vegetables (garlic and onion) are one of the flavorings inpeople's daily diets. Observational studies suggest that intake of alliumvegetables may be correlated with a lower incidence of digestive systemcancers. However, the existence of a causal relationship is still controversialdue to confounding factors and reverse causation. Therefore, we explored thecausal relationship between intake of allium vegetables and digestive systemcancers using Mendelian randomization approach. Methods: First, we performedMendelian randomization analyses using inverse variance weighting (IVW),weighted median, and MR-Egger approaches, and demonstrated the reliability ofthe results in the sensitivity step. Second, Multivariable Mendelianrandomization was applied to adjust for smoking and alcohol consumption. Third,we explored the molecular mechanisms behind the positive results throughnetwork pharmacology and molecular docking methods. Results: The study suggeststhat increased intake of garlic reduced gastric cancer risk. However, onionintake was not statistically associated with digestive system cancer.Conclusion: Garlic may have a protective effect against gastric cancer.
背景:葱类蔬菜(大蒜和洋葱)是人们日常饮食中的调味品之一。观察性研究表明,摄入葱类蔬菜可能与消化系统癌症发病率的降低有关。然而,由于混杂因素和反向因果关系,因果关系的存在仍存在争议。因此,我们采用孟德尔随机方法探讨了薤白蔬菜摄入量与消化系统癌症之间的因果关系。研究方法首先,我们使用反方差加权法(IVW)、加权中位数法和 MR-Egger 法进行了孟德尔随机分析,并在敏感性步骤中证明了结果的可靠性。其次,采用多变量孟德尔随机分析法对吸烟和饮酒情况进行调整。第三,我们通过网络药理学和分子对接方法探索了阳性结果背后的分子机制。研究结果研究表明,增加大蒜的摄入量可降低胃癌风险。结论:大蒜可能具有预防胃癌的作用:结论:大蒜可能对胃癌有保护作用。
{"title":"Allium Vegetables Intake and Digestive System Cancer Risk: A Study Based on Mendelian Randomization, Network Pharmacology and Molecular Docking","authors":"Shuhao Li, Jingwen Lou, Yelina Mulatihan, Yuhang Xiong, Yao Li, Qi Xu","doi":"arxiv-2409.11187","DOIUrl":"https://doi.org/arxiv-2409.11187","url":null,"abstract":"Background: Allium vegetables (garlic and onion) are one of the flavorings in\u0000people's daily diets. Observational studies suggest that intake of allium\u0000vegetables may be correlated with a lower incidence of digestive system\u0000cancers. However, the existence of a causal relationship is still controversial\u0000due to confounding factors and reverse causation. Therefore, we explored the\u0000causal relationship between intake of allium vegetables and digestive system\u0000cancers using Mendelian randomization approach. Methods: First, we performed\u0000Mendelian randomization analyses using inverse variance weighting (IVW),\u0000weighted median, and MR-Egger approaches, and demonstrated the reliability of\u0000the results in the sensitivity step. Second, Multivariable Mendelian\u0000randomization was applied to adjust for smoking and alcohol consumption. Third,\u0000we explored the molecular mechanisms behind the positive results through\u0000network pharmacology and molecular docking methods. Results: The study suggests\u0000that increased intake of garlic reduced gastric cancer risk. However, onion\u0000intake was not statistically associated with digestive system cancer.\u0000Conclusion: Garlic may have a protective effect against gastric cancer.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
wgatools: an ultrafast toolkit for manipulating whole genome alignments wgatools:操作全基因组比对的超快工具包
Pub Date : 2024-09-13 DOI: arxiv-2409.08569
Wenjie Wei, Songtao Gui, Jian Yang, Erik Garrison, Jianbing Yan, Hai-Jun Liu
Summary: With the rapid development of long-read sequencing technologies, theera of individual complete genomes is approaching. We have developed wgatools,a cross-platform, ultrafast toolkit that supports a range of whole genomealignment (WGA) formats, offering practical tools for conversion, processing,statistical evaluation, and visualization of alignments, thereby facilitatingpopulation-level genome analysis and advancing functional and evolutionarygenomics. Availability and Implementation: wgatools supports diverse formatsand can process, filter, and statistically evaluate alignments, performalignment-based variant calling, and visualize alignments both locally andgenome-wide. Built with Rust for efficiency and safe memory usage, it ensuresfast performance and can handle large datasets consisting of hundreds ofgenomes. wgatools is published as free software under the MIT open-sourcelicense, and its source code is freely available athttps://github.com/wjwei-handsome/wgatools. Contact: weiwenjie@westlake.edu.cn(W.W.) or liuhaijun@yzwlab.cn (H.-J.L.).
摘要:随着长线程测序技术的快速发展,个体全基因组时代即将到来。我们开发了 wgatools,这是一个跨平台的超快工具包,支持一系列全基因组比对 (WGA) 格式,为比对的转换、处理、统计评估和可视化提供了实用工具,从而促进了种群级基因组分析,推动了功能和进化基因组学的发展。可用性和实现:wgatools 支持多种格式,可以处理、过滤和统计评估排列,执行基于排列的变异调用,并可视化本地和全基因组的排列。wgatools 是根据 MIT 开放源码许可发布的免费软件,其源代码可在 https://github.com/wjwei-handsome/wgatools 免费获取。联系方式:weiwenjie@westlake.edu.cn(W.W.) 或 liuhaijun@yzwlab.cn (H.-J.L.)。
{"title":"wgatools: an ultrafast toolkit for manipulating whole genome alignments","authors":"Wenjie Wei, Songtao Gui, Jian Yang, Erik Garrison, Jianbing Yan, Hai-Jun Liu","doi":"arxiv-2409.08569","DOIUrl":"https://doi.org/arxiv-2409.08569","url":null,"abstract":"Summary: With the rapid development of long-read sequencing technologies, the\u0000era of individual complete genomes is approaching. We have developed wgatools,\u0000a cross-platform, ultrafast toolkit that supports a range of whole genome\u0000alignment (WGA) formats, offering practical tools for conversion, processing,\u0000statistical evaluation, and visualization of alignments, thereby facilitating\u0000population-level genome analysis and advancing functional and evolutionary\u0000genomics. Availability and Implementation: wgatools supports diverse formats\u0000and can process, filter, and statistically evaluate alignments, perform\u0000alignment-based variant calling, and visualize alignments both locally and\u0000genome-wide. Built with Rust for efficiency and safe memory usage, it ensures\u0000fast performance and can handle large datasets consisting of hundreds of\u0000genomes. wgatools is published as free software under the MIT open-source\u0000license, and its source code is freely available at\u0000https://github.com/wjwei-handsome/wgatools. Contact: weiwenjie@westlake.edu.cn\u0000(W.W.) or liuhaijun@yzwlab.cn (H.-J.L.).","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Selecting Differential Splicing Methods: Practical Considerations 选择差分剪接方法:实际考虑因素
Pub Date : 2024-09-09 DOI: arxiv-2409.05458
Ben J Draper, Mark J Dunning, David C James
Alternative splicing is crucial in gene regulation, with significantimplications in clinical settings and biotechnology. This review articlecompiles bioinformatics RNA-seq tools for investigating differential splicing;offering a detailed examination of their statistical methods, caseapplications, and benefits. A total of 22 tools are categorised by theirstatistical family (parametric, non-parametric, and probabilistic) and level ofanalysis (transcript, exon, and event). The central challenges in quantifyingalternative splicing include correct splice site identification and accurateisoform deconvolution of transcripts. Benchmarking studies show no consensus ontool performance, revealing considerable variability across differentscenarios. Tools with high citation frequency and continued developermaintenance, such as DEXSeq and rMATS, are recommended for prospectiveresearchers. To aid in tool selection, a guide schematic is proposed based onvariations in data input and the required level of analysis. Additionally,advancements in long-read RNA sequencing are expected to drive the evolution ofdifferential splicing tools, reducing the need for isoform deconvolution andprompting further innovation.
另类剪接在基因调控中至关重要,对临床和生物技术具有重要影响。这篇综述文章汇集了用于研究差异剪接的生物信息学 RNA-seq 工具,对这些工具的统计方法、案例应用和优势进行了详细分析。共有 22 种工具按其统计系列(参数、非参数和概率)和分析水平(转录本、外显子和事件)进行了分类。量化替代剪接的核心挑战包括正确的剪接位点识别和准确的转录本异构解旋。标杆研究表明,目前还没有就工具性能达成共识,不同情况下的差异相当大。建议未来的研究人员使用引用频率高、开发人员持续维护的工具,如 DEXSeq 和 rMATS。为了帮助选择工具,我们根据数据输入和所需分析水平的变化提出了一个指导示意图。此外,长读程 RNA 测序的进步有望推动差异剪接工具的发展,从而减少对同工酶解的需求,并促进进一步的创新。
{"title":"Selecting Differential Splicing Methods: Practical Considerations","authors":"Ben J Draper, Mark J Dunning, David C James","doi":"arxiv-2409.05458","DOIUrl":"https://doi.org/arxiv-2409.05458","url":null,"abstract":"Alternative splicing is crucial in gene regulation, with significant\u0000implications in clinical settings and biotechnology. This review article\u0000compiles bioinformatics RNA-seq tools for investigating differential splicing;\u0000offering a detailed examination of their statistical methods, case\u0000applications, and benefits. A total of 22 tools are categorised by their\u0000statistical family (parametric, non-parametric, and probabilistic) and level of\u0000analysis (transcript, exon, and event). The central challenges in quantifying\u0000alternative splicing include correct splice site identification and accurate\u0000isoform deconvolution of transcripts. Benchmarking studies show no consensus on\u0000tool performance, revealing considerable variability across different\u0000scenarios. Tools with high citation frequency and continued developer\u0000maintenance, such as DEXSeq and rMATS, are recommended for prospective\u0000researchers. To aid in tool selection, a guide schematic is proposed based on\u0000variations in data input and the required level of analysis. Additionally,\u0000advancements in long-read RNA sequencing are expected to drive the evolution of\u0000differential splicing tools, reducing the need for isoform deconvolution and\u0000prompting further innovation.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"114 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advancements in practical k-mer sets: essentials for the curious 实用 k-mer 集的进展:好奇者的必备知识
Pub Date : 2024-09-08 DOI: arxiv-2409.05210
Camille Marchet
This paper provides a comprehensive survey of data structures forrepresenting k-mer sets, which are fundamental in high-throughput sequencinganalysis. It categorizes the methods into two main strategies: those usingfingerprinting and hashing for compact storage, and those leveraginglexicographic properties for efficient representation. The paper reviews keyoperations supported by these structures, such as membership queries anddynamic updates, and highlights recent advancements in memory efficiency andquery speed. A companion paper explores colored k-mer sets, which extend theseconcepts to integrate multiple datasets or genomes.
本文全面考察了表示 k-mer 集的数据结构,k-mer 集是高通量测序分析的基础。它将这些方法分为两种主要策略:一种是使用指纹和散列进行紧凑存储,另一种是利用反射特性进行高效表示。论文回顾了这些结构所支持的关键操作,如成员查询和动态更新,并重点介绍了内存效率和查询速度方面的最新进展。另一篇论文探讨了彩色 k-mer 集,它扩展了这些概念以整合多个数据集或基因组。
{"title":"Advancements in practical k-mer sets: essentials for the curious","authors":"Camille Marchet","doi":"arxiv-2409.05210","DOIUrl":"https://doi.org/arxiv-2409.05210","url":null,"abstract":"This paper provides a comprehensive survey of data structures for\u0000representing k-mer sets, which are fundamental in high-throughput sequencing\u0000analysis. It categorizes the methods into two main strategies: those using\u0000fingerprinting and hashing for compact storage, and those leveraging\u0000lexicographic properties for efficient representation. The paper reviews key\u0000operations supported by these structures, such as membership queries and\u0000dynamic updates, and highlights recent advancements in memory efficiency and\u0000query speed. A companion paper explores colored k-mer sets, which extend these\u0000concepts to integrate multiple datasets or genomes.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"86 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine Learning-Based Prediction of Key Genes Correlated to the Subretinal Lesion Severity in a Mouse Model of Age-Related Macular Degeneration 基于机器学习预测与年龄相关性黄斑变性小鼠模型中视网膜下病变严重程度相关的关键基因
Pub Date : 2024-09-08 DOI: arxiv-2409.05047
Kuan Yan, Yue Zeng, Dai Shi, Ting Zhang, Dmytro Matsypura, Mark C. Gillies, Ling Zhu, Junbin Gao
Age-related macular degeneration (AMD) is a major cause of blindness in olderadults, severely affecting vision and quality of life. Despite advances inunderstanding AMD, the molecular factors driving the severity of subretinalscarring (fibrosis) remain elusive, hampering the development of effectivetherapies. This study introduces a machine learning-based framework to predictkey genes that are strongly correlated with lesion severity and to identifypotential therapeutic targets to prevent subretinal fibrosis in AMD. Using anoriginal RNA sequencing (RNA-seq) dataset from the diseased retinas of JR5558mice, we developed a novel and specific feature engineering technique,including pathway-based dimensionality reduction and gene-based featureexpansion, to enhance prediction accuracy. Two iterative experiments wereconducted by leveraging Ridge and ElasticNet regression models to assessbiological relevance and gene impact. The results highlight the biologicalsignificance of several key genes and demonstrate the framework's effectivenessin identifying novel therapeutic targets. The key findings provide valuableinsights for advancing drug discovery efforts and improving treatmentstrategies for AMD, with the potential to enhance patient outcomes by targetingthe underlying genetic mechanisms of subretinal lesion development.
老年性黄斑变性(AMD)是老年人失明的主要原因,严重影响视力和生活质量。尽管人们对黄斑变性的认识取得了进展,但导致视网膜下瘢痕(纤维化)严重程度的分子因素仍然难以捉摸,这阻碍了有效疗法的开发。本研究引入了一种基于机器学习的框架,用于预测与病变严重程度密切相关的关键基因,并确定潜在的治疗靶点,以预防AMD视网膜下纤维化。我们利用来自 JR5558 小鼠病变视网膜的原始 RNA 测序(RNA-seq)数据集,开发了一种新颖而特殊的特征工程技术,包括基于通路的降维和基于基因的特征扩展,以提高预测的准确性。我们利用 Ridge 和 ElasticNet 回归模型进行了两次迭代实验,以评估生物学相关性和基因影响。结果凸显了几个关键基因的生物学意义,并证明了该框架在识别新型治疗靶点方面的有效性。这些重要发现为推进药物发现工作和改善老年性视网膜病变的治疗策略提供了有价值的见解,并有可能通过针对视网膜下病变发展的潜在遗传机制来提高患者的治疗效果。
{"title":"Machine Learning-Based Prediction of Key Genes Correlated to the Subretinal Lesion Severity in a Mouse Model of Age-Related Macular Degeneration","authors":"Kuan Yan, Yue Zeng, Dai Shi, Ting Zhang, Dmytro Matsypura, Mark C. Gillies, Ling Zhu, Junbin Gao","doi":"arxiv-2409.05047","DOIUrl":"https://doi.org/arxiv-2409.05047","url":null,"abstract":"Age-related macular degeneration (AMD) is a major cause of blindness in older\u0000adults, severely affecting vision and quality of life. Despite advances in\u0000understanding AMD, the molecular factors driving the severity of subretinal\u0000scarring (fibrosis) remain elusive, hampering the development of effective\u0000therapies. This study introduces a machine learning-based framework to predict\u0000key genes that are strongly correlated with lesion severity and to identify\u0000potential therapeutic targets to prevent subretinal fibrosis in AMD. Using an\u0000original RNA sequencing (RNA-seq) dataset from the diseased retinas of JR5558\u0000mice, we developed a novel and specific feature engineering technique,\u0000including pathway-based dimensionality reduction and gene-based feature\u0000expansion, to enhance prediction accuracy. Two iterative experiments were\u0000conducted by leveraging Ridge and ElasticNet regression models to assess\u0000biological relevance and gene impact. The results highlight the biological\u0000significance of several key genes and demonstrate the framework's effectiveness\u0000in identifying novel therapeutic targets. The key findings provide valuable\u0000insights for advancing drug discovery efforts and improving treatment\u0000strategies for AMD, with the potential to enhance patient outcomes by targeting\u0000the underlying genetic mechanisms of subretinal lesion development.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advancements in colored k-mer sets: essentials for the curious 彩色 k-mer 集的进展:好奇者的必备知识
Pub Date : 2024-09-08 DOI: arxiv-2409.05214
Camille Marchet
This paper provides a comprehensive review of recent advancements ink-mer-based data structures representing collections of several samples(sometimes called colored de Bruijn graphs) and their applications inlarge-scale sequence indexing and pangenomics. The review explores theevolution of k-mer set representations, highlighting the trade-offs betweenexact and inexact methods, as well as the integration of compression strategiesand modular implementations. I discuss the impact of these structures onpractical applications and describe recent utilization of these methods foranalysis. By surveying the state-of-the-art techniques and identifying emergingtrends, this work aims to guide researchers in selecting and developing methodsfor large scale and reference-free genomic data. For a broader overview ofk-mer set representations and foundational data structures, see theaccompanying article on practical k-mer sets.
本文全面综述了代表多个样本集合(有时称为彩色德布鲁因图)的基于墨子的数据结构的最新进展及其在大规模序列索引和泛基因组学中的应用。这篇综述探讨了 k-mer 集表示法的演变,强调了精确方法和非精确方法之间的权衡,以及压缩策略和模块化实现的整合。我讨论了这些结构对实际应用的影响,并介绍了最近利用这些方法进行分析的情况。通过调查最先进的技术和识别新兴趋势,这项工作旨在指导研究人员选择和开发用于大规模和无参考文献基因组数据的方法。有关 k-mer 集表示法和基础数据结构的更广泛概述,请参阅有关实用 k-mer 集的配套文章。
{"title":"Advancements in colored k-mer sets: essentials for the curious","authors":"Camille Marchet","doi":"arxiv-2409.05214","DOIUrl":"https://doi.org/arxiv-2409.05214","url":null,"abstract":"This paper provides a comprehensive review of recent advancements in\u0000k-mer-based data structures representing collections of several samples\u0000(sometimes called colored de Bruijn graphs) and their applications in\u0000large-scale sequence indexing and pangenomics. The review explores the\u0000evolution of k-mer set representations, highlighting the trade-offs between\u0000exact and inexact methods, as well as the integration of compression strategies\u0000and modular implementations. I discuss the impact of these structures on\u0000practical applications and describe recent utilization of these methods for\u0000analysis. By surveying the state-of-the-art techniques and identifying emerging\u0000trends, this work aims to guide researchers in selecting and developing methods\u0000for large scale and reference-free genomic data. For a broader overview of\u0000k-mer set representations and foundational data structures, see the\u0000accompanying article on practical k-mer sets.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Nearest Neighbor CCP-Based Molecular Sequence Analysis 基于近邻 CCP 的分子序列分析
Pub Date : 2024-09-07 DOI: arxiv-2409.04922
Sarwan Ali, Prakash Chourasia, Bipin Koirala, Murray Patterson
Molecular sequence analysis is crucial for comprehending several biologicalprocesses, including protein-protein interactions, functional annotation, anddisease classification. The large number of sequences and the inherentlycomplicated nature of protein structures make it challenging to analyze suchdata. Finding patterns and enhancing subsequent research requires the use ofdimensionality reduction and feature selection approaches. Recently, a methodcalled Correlated Clustering and Projection (CCP) has been proposed as aneffective method for biological sequencing data. The CCP technique is stillcostly to compute even though it is effective for sequence visualization.Furthermore, its utility for classifying molecular sequences is stilluncertain. To solve these two problems, we present a Nearest NeighborCorrelated Clustering and Projection (CCP-NN)-based technique for efficientlypreprocessing molecular sequence data. To group related molecular sequences andproduce representative supersequences, CCP makes use of sequence-to-sequencecorrelations. As opposed to conventional methods, CCP doesn't rely on matrixdiagonalization, therefore it can be applied to a range of machine-learningproblems. We estimate the density map and compute the correlation using anearest-neighbor search technique. We performed molecular sequenceclassification using CCP and CCP-NN representations to assess the efficacy ofour proposed approach. Our findings show that CCP-NN considerably improvesclassification task accuracy as well as significantly outperforms CCP in termsof computational runtime.
分子序列分析对于理解多种生物过程(包括蛋白质-蛋白质相互作用、功能注释和疾病分类)至关重要。大量的序列和蛋白质结构本身的复杂性使得分析此类数据极具挑战性。寻找模式和加强后续研究需要使用降维和特征选择方法。最近,一种名为 "相关聚类和投影(CCP)"的方法被提出,它是一种有效的生物测序数据分析方法。尽管 CCP 技术对序列可视化很有效,但其计算成本仍然很高。为了解决这两个问题,我们提出了一种基于近邻相关聚类和投影(CCP-NN)的技术,用于高效预处理分子序列数据。为了对相关的分子序列进行分组并产生有代表性的超序列,CCP 利用了序列间的相关性。与传统方法相比,CCP 不依赖于矩阵对角化,因此可以应用于一系列机器学习问题。我们使用最近邻搜索技术估计密度图并计算相关性。我们使用 CCP 和 CCP-NN 表示法进行了分子序列分类,以评估我们提出的方法的有效性。我们的研究结果表明,CCP-NN 大大提高了分类任务的准确性,而且在计算运行时间方面明显优于 CCP。
{"title":"Nearest Neighbor CCP-Based Molecular Sequence Analysis","authors":"Sarwan Ali, Prakash Chourasia, Bipin Koirala, Murray Patterson","doi":"arxiv-2409.04922","DOIUrl":"https://doi.org/arxiv-2409.04922","url":null,"abstract":"Molecular sequence analysis is crucial for comprehending several biological\u0000processes, including protein-protein interactions, functional annotation, and\u0000disease classification. The large number of sequences and the inherently\u0000complicated nature of protein structures make it challenging to analyze such\u0000data. Finding patterns and enhancing subsequent research requires the use of\u0000dimensionality reduction and feature selection approaches. Recently, a method\u0000called Correlated Clustering and Projection (CCP) has been proposed as an\u0000effective method for biological sequencing data. The CCP technique is still\u0000costly to compute even though it is effective for sequence visualization.\u0000Furthermore, its utility for classifying molecular sequences is still\u0000uncertain. To solve these two problems, we present a Nearest Neighbor\u0000Correlated Clustering and Projection (CCP-NN)-based technique for efficiently\u0000preprocessing molecular sequence data. To group related molecular sequences and\u0000produce representative supersequences, CCP makes use of sequence-to-sequence\u0000correlations. As opposed to conventional methods, CCP doesn't rely on matrix\u0000diagonalization, therefore it can be applied to a range of machine-learning\u0000problems. We estimate the density map and compute the correlation using a\u0000nearest-neighbor search technique. We performed molecular sequence\u0000classification using CCP and CCP-NN representations to assess the efficacy of\u0000our proposed approach. Our findings show that CCP-NN considerably improves\u0000classification task accuracy as well as significantly outperforms CCP in terms\u0000of computational runtime.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"2017 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Toward Capturing Genetic Epistasis From Multivariate Genome-Wide Association Studies Using Mixed-Precision Kernel Ridge Regression 利用混合精度核岭回归从多变量全基因组关联研究中捕捉遗传外显性
Pub Date : 2024-09-03 DOI: arxiv-2409.01712
Hatem Ltaief, Rabab Alomairy, Qinglei Cao, Jie Ren, Lotfi Slim, Thorsten Kurth, Benedikt Dorschner, Salim Bougouffa, Rached Abdelkhalak, David E. Keyes
We exploit the widening margin in tensor-core performance between[FP64/FP32/FP16/INT8,FP64/FP32/FP16/FP8/INT8] on NVIDIA [Ampere,Hopper] GPUs toboost the performance of output accuracy-preserving mixed-precision computationof Genome-Wide Association Studies (GWAS) of 305K patients from the UK BioBank,the largest-ever GWAS cohort studied for genetic epistasis using a multivariateapproach. Tile-centric adaptive-precision linear algebraic techniques motivatedby reducing data motion gain enhanced significance with low-precision GPUarithmetic. At the core of Kernel Ridge Regression (KRR) techniques for GWASlie compute-bound cubic-complexity matrix operations that inhibit scaling toaspirational dimensions of the population, genotypes, and phenotypes. Weaccelerate KRR matrix generation by redesigning the computation for Euclideandistances to engage INT8 tensor cores while exploiting symmetry.We acceleratesolution of the regularized KRR systems by deploying a new four-precisionCholesky-based solver, which, at 1.805 mixed-precision ExaOp/s on a nearly fullAlps system, outperforms the state-of-the-art CPU-only REGENIE GWAS software byfive orders of magnitude.
我们利用英伟达™(NVIDIA®)[Ampere,Hopper] GPU 上[FP64/FP32/FP16/INT8,FP64/FP32/FP16/FP8/INT8]之间不断扩大的张量核性能差距,提高了对英国生物库(UK BioBank)305K 患者的全基因组关联研究(GWAS)的输出精度保护混合精度计算的性能。以瓦片为中心的自适应精度线性代数技术以减少数据运动为动机,通过低精度 GPU 算法获得了更大的意义。用于 GWAS 的核岭上回归(KRR)技术的核心是计算约束立方复杂度矩阵运算,这种运算会抑制扩展到种群、基因型和表型的灵感维度。我们通过重新设计欧几里得和间距的计算,在利用对称性的同时让 INT8 张量内核参与其中,从而加速了 KRR 矩阵的生成。我们通过部署一种新的基于四精度 Cholesky 的求解器,加速了正则化 KRR 系统的求解,该求解器在几乎全 Alps 系统上的混合精度为 1.805 ExaOp/s,比最先进的仅使用 CPU 的 REGENIE GWAS 软件高出五个数量级。
{"title":"Toward Capturing Genetic Epistasis From Multivariate Genome-Wide Association Studies Using Mixed-Precision Kernel Ridge Regression","authors":"Hatem Ltaief, Rabab Alomairy, Qinglei Cao, Jie Ren, Lotfi Slim, Thorsten Kurth, Benedikt Dorschner, Salim Bougouffa, Rached Abdelkhalak, David E. Keyes","doi":"arxiv-2409.01712","DOIUrl":"https://doi.org/arxiv-2409.01712","url":null,"abstract":"We exploit the widening margin in tensor-core performance between\u0000[FP64/FP32/FP16/INT8,FP64/FP32/FP16/FP8/INT8] on NVIDIA [Ampere,Hopper] GPUs to\u0000boost the performance of output accuracy-preserving mixed-precision computation\u0000of Genome-Wide Association Studies (GWAS) of 305K patients from the UK BioBank,\u0000the largest-ever GWAS cohort studied for genetic epistasis using a multivariate\u0000approach. Tile-centric adaptive-precision linear algebraic techniques motivated\u0000by reducing data motion gain enhanced significance with low-precision GPU\u0000arithmetic. At the core of Kernel Ridge Regression (KRR) techniques for GWAS\u0000lie compute-bound cubic-complexity matrix operations that inhibit scaling to\u0000aspirational dimensions of the population, genotypes, and phenotypes. We\u0000accelerate KRR matrix generation by redesigning the computation for Euclidean\u0000distances to engage INT8 tensor cores while exploiting symmetry.We accelerate\u0000solution of the regularized KRR systems by deploying a new four-precision\u0000Cholesky-based solver, which, at 1.805 mixed-precision ExaOp/s on a nearly full\u0000Alps system, outperforms the state-of-the-art CPU-only REGENIE GWAS software by\u0000five orders of magnitude.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CMOB: Large-Scale Cancer Multi-Omics Benchmark with Open Datasets, Tasks, and Baselines CMOB:具有开放数据集、任务和基线的大规模癌症多指标基准测试
Pub Date : 2024-09-02 DOI: arxiv-2409.02143
Ziwei Yang, Rikuto Kotoge, Zheng Chen, Xihao Piao, Yasuko Matsubara, Yasushi Sakurai
Machine learning has shown great potential in the field of cancer multi-omicsstudies, offering incredible opportunities for advancing precision medicine.However, the challenges associated with dataset curation and task formulationpose significant hurdles, especially for researchers lacking a biomedicalbackground. Here, we introduce the CMOB, the first large-scale cancermulti-omics benchmark integrates the TCGA platform, making data resourcesaccessible and usable for machine learning researchers without significantpreparation and expertise.To date, CMOB includes a collection of 20 cancermulti-omics datasets covering 32 cancers, accompanied by a systematic dataprocessing pipeline. CMOB provides well-processed dataset versions to support20 meaningful tasks in four studies, with a collection of benchmarks. We alsointegrate CMOB with two complementary resources and various biological tools toexplore broader research avenues.All resources are open-accessible withuser-friendly and compatible integration scripts that enable non-experts toeasily incorporate this complementary information for various tasks. We conductextensive experiments on selected datasets to offer recommendations on suitablemachine learning baselines for specific applications. Through CMOB, we aim tofacilitate algorithmic advances and hasten the development, validation, andclinical translation of machine-learning models for personalized cancertreatments. CMOB is available on GitHub(url{https://github.com/chenzRG/Cancer-Multi-Omics-Benchmark}).
机器学习在癌症多组学研究领域显示出巨大的潜力,为推进精准医疗提供了难以置信的机遇。然而,与数据集整理和任务制定相关的挑战带来了巨大的障碍,尤其是对于缺乏生物医学背景的研究人员而言。在这里,我们介绍 CMOB,它是第一个集成了 TCGA 平台的大规模癌症多组学基准,使机器学习研究人员无需大量准备工作和专业知识就能获得和使用数据资源。迄今为止,CMOB 包括 20 个癌症多组学数据集,涵盖 32 种癌症,并附有系统的数据处理管道。CMOB 提供了经过良好处理的数据集版本,以支持四项研究中 20 项有意义的任务,并提供了一系列基准。我们还将 CMOB 与两个补充资源和各种生物工具进行了整合,以探索更广泛的研究途径。所有资源都是开放式的,具有用户友好和兼容的整合脚本,使非专业人员也能轻松地将这些补充信息整合到各种任务中。我们在选定的数据集上进行大量实验,为特定应用提供合适的机器学习基线建议。通过 CMOB,我们的目标是促进算法进步,加快个性化癌症治疗机器学习模型的开发、验证和临床转化。CMOB可在GitHub(url{https://github.com/chenzRG/Cancer-Multi-Omics-Benchmark})上下载。
{"title":"CMOB: Large-Scale Cancer Multi-Omics Benchmark with Open Datasets, Tasks, and Baselines","authors":"Ziwei Yang, Rikuto Kotoge, Zheng Chen, Xihao Piao, Yasuko Matsubara, Yasushi Sakurai","doi":"arxiv-2409.02143","DOIUrl":"https://doi.org/arxiv-2409.02143","url":null,"abstract":"Machine learning has shown great potential in the field of cancer multi-omics\u0000studies, offering incredible opportunities for advancing precision medicine.\u0000However, the challenges associated with dataset curation and task formulation\u0000pose significant hurdles, especially for researchers lacking a biomedical\u0000background. Here, we introduce the CMOB, the first large-scale cancer\u0000multi-omics benchmark integrates the TCGA platform, making data resources\u0000accessible and usable for machine learning researchers without significant\u0000preparation and expertise.To date, CMOB includes a collection of 20 cancer\u0000multi-omics datasets covering 32 cancers, accompanied by a systematic data\u0000processing pipeline. CMOB provides well-processed dataset versions to support\u000020 meaningful tasks in four studies, with a collection of benchmarks. We also\u0000integrate CMOB with two complementary resources and various biological tools to\u0000explore broader research avenues.All resources are open-accessible with\u0000user-friendly and compatible integration scripts that enable non-experts to\u0000easily incorporate this complementary information for various tasks. We conduct\u0000extensive experiments on selected datasets to offer recommendations on suitable\u0000machine learning baselines for specific applications. Through CMOB, we aim to\u0000facilitate algorithmic advances and hasten the development, validation, and\u0000clinical translation of machine-learning models for personalized cancer\u0000treatments. CMOB is available on GitHub\u0000(url{https://github.com/chenzRG/Cancer-Multi-Omics-Benchmark}).","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BWT construction and search at the terabase scale 在太基准尺度上构建和搜索 BWT
Pub Date : 2024-09-01 DOI: arxiv-2409.00613
Heng Li
Motivation: Burrows-Wheeler Transform (BWT) is a common component infull-text indices. Initially developed for data compression, it is particularlypowerful for encoding redundant sequences such as pangenome data. However, BWTconstruction is resource intensive and hard to be parallelized, and manymethods for querying large full-text indices only report exact matches or theirsimple extensions. These limitations have hampered the biological applicationsof full-text indices. Results: We developed ropebwt3 for efficient BWT construction and query.Ropebwt3 could index 100 assembled human genomes in 21 hours and index 7.3terabases of commonly studied bacterial assemblies in 26 days. This wasachieved using 82 gigabytes of memory at the peak without working disk space.Ropebwt3 can find maximal exact matches and inexact alignments under affine-gappenalties, and can retrieve all distinct local haplotypes matching a querysequence. It demonstrates the feasibility of full-text indexing at the terabasescale. Availability and implementation: https://github.com/lh3/ropebwt3
动机Burrows-Wheeler Transform(BWT)是全文索引中的一个常用组件。它最初是为数据压缩而开发的,尤其适用于冗余序列(如泛基因组数据)的编码。然而,BWT 的构建需要大量资源,难以并行化,而且许多查询大型全文索引的方法只能报告精确匹配或其简单扩展。这些局限性阻碍了全文索引在生物学上的应用。结果Ropebwt3 可在 21 小时内为 100 个已组装的人类基因组建立索引,并在 26 天内为 7.3 个常用细菌组装数据库建立索引。Ropebwt3 可以在仿射校正条件下找到最大精确匹配和不精确排列,并能检索与查询序列匹配的所有不同的局部单倍型。它证明了全文索引在大型数据库中的可行性。可用性和实现:https://github.com/lh3/ropebwt3
{"title":"BWT construction and search at the terabase scale","authors":"Heng Li","doi":"arxiv-2409.00613","DOIUrl":"https://doi.org/arxiv-2409.00613","url":null,"abstract":"Motivation: Burrows-Wheeler Transform (BWT) is a common component in\u0000full-text indices. Initially developed for data compression, it is particularly\u0000powerful for encoding redundant sequences such as pangenome data. However, BWT\u0000construction is resource intensive and hard to be parallelized, and many\u0000methods for querying large full-text indices only report exact matches or their\u0000simple extensions. These limitations have hampered the biological applications\u0000of full-text indices. Results: We developed ropebwt3 for efficient BWT construction and query.\u0000Ropebwt3 could index 100 assembled human genomes in 21 hours and index 7.3\u0000terabases of commonly studied bacterial assemblies in 26 days. This was\u0000achieved using 82 gigabytes of memory at the peak without working disk space.\u0000Ropebwt3 can find maximal exact matches and inexact alignments under affine-gap\u0000penalties, and can retrieve all distinct local haplotypes matching a query\u0000sequence. It demonstrates the feasibility of full-text indexing at the terabase\u0000scale. Availability and implementation: https://github.com/lh3/ropebwt3","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - QuanBio - Genomics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1