Background: Allium vegetables (garlic and onion) are one of the flavorings in people's daily diets. Observational studies suggest that intake of allium vegetables may be correlated with a lower incidence of digestive system cancers. However, the existence of a causal relationship is still controversial due to confounding factors and reverse causation. Therefore, we explored the causal relationship between intake of allium vegetables and digestive system cancers using Mendelian randomization approach. Methods: First, we performed Mendelian randomization analyses using inverse variance weighting (IVW), weighted median, and MR-Egger approaches, and demonstrated the reliability of the results in the sensitivity step. Second, Multivariable Mendelian randomization was applied to adjust for smoking and alcohol consumption. Third, we explored the molecular mechanisms behind the positive results through network pharmacology and molecular docking methods. Results: The study suggests that increased intake of garlic reduced gastric cancer risk. However, onion intake was not statistically associated with digestive system cancer. Conclusion: Garlic may have a protective effect against gastric cancer.
{"title":"Allium Vegetables Intake and Digestive System Cancer Risk: A Study Based on Mendelian Randomization, Network Pharmacology and Molecular Docking","authors":"Shuhao Li, Jingwen Lou, Yelina Mulatihan, Yuhang Xiong, Yao Li, Qi Xu","doi":"arxiv-2409.11187","DOIUrl":"https://doi.org/arxiv-2409.11187","url":null,"abstract":"Background: Allium vegetables (garlic and onion) are one of the flavorings in\u0000people's daily diets. Observational studies suggest that intake of allium\u0000vegetables may be correlated with a lower incidence of digestive system\u0000cancers. However, the existence of a causal relationship is still controversial\u0000due to confounding factors and reverse causation. Therefore, we explored the\u0000causal relationship between intake of allium vegetables and digestive system\u0000cancers using Mendelian randomization approach. Methods: First, we performed\u0000Mendelian randomization analyses using inverse variance weighting (IVW),\u0000weighted median, and MR-Egger approaches, and demonstrated the reliability of\u0000the results in the sensitivity step. Second, Multivariable Mendelian\u0000randomization was applied to adjust for smoking and alcohol consumption. Third,\u0000we explored the molecular mechanisms behind the positive results through\u0000network pharmacology and molecular docking methods. Results: The study suggests\u0000that increased intake of garlic reduced gastric cancer risk. However, onion\u0000intake was not statistically associated with digestive system cancer.\u0000Conclusion: Garlic may have a protective effect against gastric cancer.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenjie Wei, Songtao Gui, Jian Yang, Erik Garrison, Jianbing Yan, Hai-Jun Liu
Summary: With the rapid development of long-read sequencing technologies, the era of individual complete genomes is approaching. We have developed wgatools, a cross-platform, ultrafast toolkit that supports a range of whole genome alignment (WGA) formats, offering practical tools for conversion, processing, statistical evaluation, and visualization of alignments, thereby facilitating population-level genome analysis and advancing functional and evolutionary genomics. Availability and Implementation: wgatools supports diverse formats and can process, filter, and statistically evaluate alignments, perform alignment-based variant calling, and visualize alignments both locally and genome-wide. Built with Rust for efficiency and safe memory usage, it ensures fast performance and can handle large datasets consisting of hundreds of genomes. wgatools is published as free software under the MIT open-source license, and its source code is freely available at https://github.com/wjwei-handsome/wgatools. Contact: weiwenjie@westlake.edu.cn (W.W.) or liuhaijun@yzwlab.cn (H.-J.L.).
{"title":"wgatools: an ultrafast toolkit for manipulating whole genome alignments","authors":"Wenjie Wei, Songtao Gui, Jian Yang, Erik Garrison, Jianbing Yan, Hai-Jun Liu","doi":"arxiv-2409.08569","DOIUrl":"https://doi.org/arxiv-2409.08569","url":null,"abstract":"Summary: With the rapid development of long-read sequencing technologies, the\u0000era of individual complete genomes is approaching. We have developed wgatools,\u0000a cross-platform, ultrafast toolkit that supports a range of whole genome\u0000alignment (WGA) formats, offering practical tools for conversion, processing,\u0000statistical evaluation, and visualization of alignments, thereby facilitating\u0000population-level genome analysis and advancing functional and evolutionary\u0000genomics. Availability and Implementation: wgatools supports diverse formats\u0000and can process, filter, and statistically evaluate alignments, perform\u0000alignment-based variant calling, and visualize alignments both locally and\u0000genome-wide. Built with Rust for efficiency and safe memory usage, it ensures\u0000fast performance and can handle large datasets consisting of hundreds of\u0000genomes. wgatools is published as free software under the MIT open-source\u0000license, and its source code is freely available at\u0000https://github.com/wjwei-handsome/wgatools. Contact: weiwenjie@westlake.edu.cn\u0000(W.W.) or liuhaijun@yzwlab.cn (H.-J.L.).","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alternative splicing is crucial in gene regulation, with significant implications in clinical settings and biotechnology. This review article compiles bioinformatics RNA-seq tools for investigating differential splicing; offering a detailed examination of their statistical methods, case applications, and benefits. A total of 22 tools are categorised by their statistical family (parametric, non-parametric, and probabilistic) and level of analysis (transcript, exon, and event). The central challenges in quantifying alternative splicing include correct splice site identification and accurate isoform deconvolution of transcripts. Benchmarking studies show no consensus on tool performance, revealing considerable variability across different scenarios. Tools with high citation frequency and continued developer maintenance, such as DEXSeq and rMATS, are recommended for prospective researchers. To aid in tool selection, a guide schematic is proposed based on variations in data input and the required level of analysis. Additionally, advancements in long-read RNA sequencing are expected to drive the evolution of differential splicing tools, reducing the need for isoform deconvolution and prompting further innovation.
{"title":"Selecting Differential Splicing Methods: Practical Considerations","authors":"Ben J Draper, Mark J Dunning, David C James","doi":"arxiv-2409.05458","DOIUrl":"https://doi.org/arxiv-2409.05458","url":null,"abstract":"Alternative splicing is crucial in gene regulation, with significant\u0000implications in clinical settings and biotechnology. This review article\u0000compiles bioinformatics RNA-seq tools for investigating differential splicing;\u0000offering a detailed examination of their statistical methods, case\u0000applications, and benefits. A total of 22 tools are categorised by their\u0000statistical family (parametric, non-parametric, and probabilistic) and level of\u0000analysis (transcript, exon, and event). The central challenges in quantifying\u0000alternative splicing include correct splice site identification and accurate\u0000isoform deconvolution of transcripts. Benchmarking studies show no consensus on\u0000tool performance, revealing considerable variability across different\u0000scenarios. Tools with high citation frequency and continued developer\u0000maintenance, such as DEXSeq and rMATS, are recommended for prospective\u0000researchers. To aid in tool selection, a guide schematic is proposed based on\u0000variations in data input and the required level of analysis. Additionally,\u0000advancements in long-read RNA sequencing are expected to drive the evolution of\u0000differential splicing tools, reducing the need for isoform deconvolution and\u0000prompting further innovation.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"114 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper provides a comprehensive survey of data structures for representing k-mer sets, which are fundamental in high-throughput sequencing analysis. It categorizes the methods into two main strategies: those using fingerprinting and hashing for compact storage, and those leveraging lexicographic properties for efficient representation. The paper reviews key operations supported by these structures, such as membership queries and dynamic updates, and highlights recent advancements in memory efficiency and query speed. A companion paper explores colored k-mer sets, which extend these concepts to integrate multiple datasets or genomes.
{"title":"Advancements in practical k-mer sets: essentials for the curious","authors":"Camille Marchet","doi":"arxiv-2409.05210","DOIUrl":"https://doi.org/arxiv-2409.05210","url":null,"abstract":"This paper provides a comprehensive survey of data structures for\u0000representing k-mer sets, which are fundamental in high-throughput sequencing\u0000analysis. It categorizes the methods into two main strategies: those using\u0000fingerprinting and hashing for compact storage, and those leveraging\u0000lexicographic properties for efficient representation. The paper reviews key\u0000operations supported by these structures, such as membership queries and\u0000dynamic updates, and highlights recent advancements in memory efficiency and\u0000query speed. A companion paper explores colored k-mer sets, which extend these\u0000concepts to integrate multiple datasets or genomes.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"86 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kuan Yan, Yue Zeng, Dai Shi, Ting Zhang, Dmytro Matsypura, Mark C. Gillies, Ling Zhu, Junbin Gao
Age-related macular degeneration (AMD) is a major cause of blindness in older adults, severely affecting vision and quality of life. Despite advances in understanding AMD, the molecular factors driving the severity of subretinal scarring (fibrosis) remain elusive, hampering the development of effective therapies. This study introduces a machine learning-based framework to predict key genes that are strongly correlated with lesion severity and to identify potential therapeutic targets to prevent subretinal fibrosis in AMD. Using an original RNA sequencing (RNA-seq) dataset from the diseased retinas of JR5558 mice, we developed a novel and specific feature engineering technique, including pathway-based dimensionality reduction and gene-based feature expansion, to enhance prediction accuracy. Two iterative experiments were conducted by leveraging Ridge and ElasticNet regression models to assess biological relevance and gene impact. The results highlight the biological significance of several key genes and demonstrate the framework's effectiveness in identifying novel therapeutic targets. The key findings provide valuable insights for advancing drug discovery efforts and improving treatment strategies for AMD, with the potential to enhance patient outcomes by targeting the underlying genetic mechanisms of subretinal lesion development.
{"title":"Machine Learning-Based Prediction of Key Genes Correlated to the Subretinal Lesion Severity in a Mouse Model of Age-Related Macular Degeneration","authors":"Kuan Yan, Yue Zeng, Dai Shi, Ting Zhang, Dmytro Matsypura, Mark C. Gillies, Ling Zhu, Junbin Gao","doi":"arxiv-2409.05047","DOIUrl":"https://doi.org/arxiv-2409.05047","url":null,"abstract":"Age-related macular degeneration (AMD) is a major cause of blindness in older\u0000adults, severely affecting vision and quality of life. Despite advances in\u0000understanding AMD, the molecular factors driving the severity of subretinal\u0000scarring (fibrosis) remain elusive, hampering the development of effective\u0000therapies. This study introduces a machine learning-based framework to predict\u0000key genes that are strongly correlated with lesion severity and to identify\u0000potential therapeutic targets to prevent subretinal fibrosis in AMD. Using an\u0000original RNA sequencing (RNA-seq) dataset from the diseased retinas of JR5558\u0000mice, we developed a novel and specific feature engineering technique,\u0000including pathway-based dimensionality reduction and gene-based feature\u0000expansion, to enhance prediction accuracy. Two iterative experiments were\u0000conducted by leveraging Ridge and ElasticNet regression models to assess\u0000biological relevance and gene impact. The results highlight the biological\u0000significance of several key genes and demonstrate the framework's effectiveness\u0000in identifying novel therapeutic targets. The key findings provide valuable\u0000insights for advancing drug discovery efforts and improving treatment\u0000strategies for AMD, with the potential to enhance patient outcomes by targeting\u0000the underlying genetic mechanisms of subretinal lesion development.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper provides a comprehensive review of recent advancements in k-mer-based data structures representing collections of several samples (sometimes called colored de Bruijn graphs) and their applications in large-scale sequence indexing and pangenomics. The review explores the evolution of k-mer set representations, highlighting the trade-offs between exact and inexact methods, as well as the integration of compression strategies and modular implementations. I discuss the impact of these structures on practical applications and describe recent utilization of these methods for analysis. By surveying the state-of-the-art techniques and identifying emerging trends, this work aims to guide researchers in selecting and developing methods for large scale and reference-free genomic data. For a broader overview of k-mer set representations and foundational data structures, see the accompanying article on practical k-mer sets.
{"title":"Advancements in colored k-mer sets: essentials for the curious","authors":"Camille Marchet","doi":"arxiv-2409.05214","DOIUrl":"https://doi.org/arxiv-2409.05214","url":null,"abstract":"This paper provides a comprehensive review of recent advancements in\u0000k-mer-based data structures representing collections of several samples\u0000(sometimes called colored de Bruijn graphs) and their applications in\u0000large-scale sequence indexing and pangenomics. The review explores the\u0000evolution of k-mer set representations, highlighting the trade-offs between\u0000exact and inexact methods, as well as the integration of compression strategies\u0000and modular implementations. I discuss the impact of these structures on\u0000practical applications and describe recent utilization of these methods for\u0000analysis. By surveying the state-of-the-art techniques and identifying emerging\u0000trends, this work aims to guide researchers in selecting and developing methods\u0000for large scale and reference-free genomic data. For a broader overview of\u0000k-mer set representations and foundational data structures, see the\u0000accompanying article on practical k-mer sets.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Molecular sequence analysis is crucial for comprehending several biological processes, including protein-protein interactions, functional annotation, and disease classification. The large number of sequences and the inherently complicated nature of protein structures make it challenging to analyze such data. Finding patterns and enhancing subsequent research requires the use of dimensionality reduction and feature selection approaches. Recently, a method called Correlated Clustering and Projection (CCP) has been proposed as an effective method for biological sequencing data. The CCP technique is still costly to compute even though it is effective for sequence visualization. Furthermore, its utility for classifying molecular sequences is still uncertain. To solve these two problems, we present a Nearest Neighbor Correlated Clustering and Projection (CCP-NN)-based technique for efficiently preprocessing molecular sequence data. To group related molecular sequences and produce representative supersequences, CCP makes use of sequence-to-sequence correlations. As opposed to conventional methods, CCP doesn't rely on matrix diagonalization, therefore it can be applied to a range of machine-learning problems. We estimate the density map and compute the correlation using a nearest-neighbor search technique. We performed molecular sequence classification using CCP and CCP-NN representations to assess the efficacy of our proposed approach. Our findings show that CCP-NN considerably improves classification task accuracy as well as significantly outperforms CCP in terms of computational runtime.
{"title":"Nearest Neighbor CCP-Based Molecular Sequence Analysis","authors":"Sarwan Ali, Prakash Chourasia, Bipin Koirala, Murray Patterson","doi":"arxiv-2409.04922","DOIUrl":"https://doi.org/arxiv-2409.04922","url":null,"abstract":"Molecular sequence analysis is crucial for comprehending several biological\u0000processes, including protein-protein interactions, functional annotation, and\u0000disease classification. The large number of sequences and the inherently\u0000complicated nature of protein structures make it challenging to analyze such\u0000data. Finding patterns and enhancing subsequent research requires the use of\u0000dimensionality reduction and feature selection approaches. Recently, a method\u0000called Correlated Clustering and Projection (CCP) has been proposed as an\u0000effective method for biological sequencing data. The CCP technique is still\u0000costly to compute even though it is effective for sequence visualization.\u0000Furthermore, its utility for classifying molecular sequences is still\u0000uncertain. To solve these two problems, we present a Nearest Neighbor\u0000Correlated Clustering and Projection (CCP-NN)-based technique for efficiently\u0000preprocessing molecular sequence data. To group related molecular sequences and\u0000produce representative supersequences, CCP makes use of sequence-to-sequence\u0000correlations. As opposed to conventional methods, CCP doesn't rely on matrix\u0000diagonalization, therefore it can be applied to a range of machine-learning\u0000problems. We estimate the density map and compute the correlation using a\u0000nearest-neighbor search technique. We performed molecular sequence\u0000classification using CCP and CCP-NN representations to assess the efficacy of\u0000our proposed approach. Our findings show that CCP-NN considerably improves\u0000classification task accuracy as well as significantly outperforms CCP in terms\u0000of computational runtime.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"2017 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hatem Ltaief, Rabab Alomairy, Qinglei Cao, Jie Ren, Lotfi Slim, Thorsten Kurth, Benedikt Dorschner, Salim Bougouffa, Rached Abdelkhalak, David E. Keyes
We exploit the widening margin in tensor-core performance between [FP64/FP32/FP16/INT8,FP64/FP32/FP16/FP8/INT8] on NVIDIA [Ampere,Hopper] GPUs to boost the performance of output accuracy-preserving mixed-precision computation of Genome-Wide Association Studies (GWAS) of 305K patients from the UK BioBank, the largest-ever GWAS cohort studied for genetic epistasis using a multivariate approach. Tile-centric adaptive-precision linear algebraic techniques motivated by reducing data motion gain enhanced significance with low-precision GPU arithmetic. At the core of Kernel Ridge Regression (KRR) techniques for GWAS lie compute-bound cubic-complexity matrix operations that inhibit scaling to aspirational dimensions of the population, genotypes, and phenotypes. We accelerate KRR matrix generation by redesigning the computation for Euclidean distances to engage INT8 tensor cores while exploiting symmetry.We accelerate solution of the regularized KRR systems by deploying a new four-precision Cholesky-based solver, which, at 1.805 mixed-precision ExaOp/s on a nearly full Alps system, outperforms the state-of-the-art CPU-only REGENIE GWAS software by five orders of magnitude.
{"title":"Toward Capturing Genetic Epistasis From Multivariate Genome-Wide Association Studies Using Mixed-Precision Kernel Ridge Regression","authors":"Hatem Ltaief, Rabab Alomairy, Qinglei Cao, Jie Ren, Lotfi Slim, Thorsten Kurth, Benedikt Dorschner, Salim Bougouffa, Rached Abdelkhalak, David E. Keyes","doi":"arxiv-2409.01712","DOIUrl":"https://doi.org/arxiv-2409.01712","url":null,"abstract":"We exploit the widening margin in tensor-core performance between\u0000[FP64/FP32/FP16/INT8,FP64/FP32/FP16/FP8/INT8] on NVIDIA [Ampere,Hopper] GPUs to\u0000boost the performance of output accuracy-preserving mixed-precision computation\u0000of Genome-Wide Association Studies (GWAS) of 305K patients from the UK BioBank,\u0000the largest-ever GWAS cohort studied for genetic epistasis using a multivariate\u0000approach. Tile-centric adaptive-precision linear algebraic techniques motivated\u0000by reducing data motion gain enhanced significance with low-precision GPU\u0000arithmetic. At the core of Kernel Ridge Regression (KRR) techniques for GWAS\u0000lie compute-bound cubic-complexity matrix operations that inhibit scaling to\u0000aspirational dimensions of the population, genotypes, and phenotypes. We\u0000accelerate KRR matrix generation by redesigning the computation for Euclidean\u0000distances to engage INT8 tensor cores while exploiting symmetry.We accelerate\u0000solution of the regularized KRR systems by deploying a new four-precision\u0000Cholesky-based solver, which, at 1.805 mixed-precision ExaOp/s on a nearly full\u0000Alps system, outperforms the state-of-the-art CPU-only REGENIE GWAS software by\u0000five orders of magnitude.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Machine learning has shown great potential in the field of cancer multi-omics studies, offering incredible opportunities for advancing precision medicine. However, the challenges associated with dataset curation and task formulation pose significant hurdles, especially for researchers lacking a biomedical background. Here, we introduce the CMOB, the first large-scale cancer multi-omics benchmark integrates the TCGA platform, making data resources accessible and usable for machine learning researchers without significant preparation and expertise.To date, CMOB includes a collection of 20 cancer multi-omics datasets covering 32 cancers, accompanied by a systematic data processing pipeline. CMOB provides well-processed dataset versions to support 20 meaningful tasks in four studies, with a collection of benchmarks. We also integrate CMOB with two complementary resources and various biological tools to explore broader research avenues.All resources are open-accessible with user-friendly and compatible integration scripts that enable non-experts to easily incorporate this complementary information for various tasks. We conduct extensive experiments on selected datasets to offer recommendations on suitable machine learning baselines for specific applications. Through CMOB, we aim to facilitate algorithmic advances and hasten the development, validation, and clinical translation of machine-learning models for personalized cancer treatments. CMOB is available on GitHub (url{https://github.com/chenzRG/Cancer-Multi-Omics-Benchmark}).
{"title":"CMOB: Large-Scale Cancer Multi-Omics Benchmark with Open Datasets, Tasks, and Baselines","authors":"Ziwei Yang, Rikuto Kotoge, Zheng Chen, Xihao Piao, Yasuko Matsubara, Yasushi Sakurai","doi":"arxiv-2409.02143","DOIUrl":"https://doi.org/arxiv-2409.02143","url":null,"abstract":"Machine learning has shown great potential in the field of cancer multi-omics\u0000studies, offering incredible opportunities for advancing precision medicine.\u0000However, the challenges associated with dataset curation and task formulation\u0000pose significant hurdles, especially for researchers lacking a biomedical\u0000background. Here, we introduce the CMOB, the first large-scale cancer\u0000multi-omics benchmark integrates the TCGA platform, making data resources\u0000accessible and usable for machine learning researchers without significant\u0000preparation and expertise.To date, CMOB includes a collection of 20 cancer\u0000multi-omics datasets covering 32 cancers, accompanied by a systematic data\u0000processing pipeline. CMOB provides well-processed dataset versions to support\u000020 meaningful tasks in four studies, with a collection of benchmarks. We also\u0000integrate CMOB with two complementary resources and various biological tools to\u0000explore broader research avenues.All resources are open-accessible with\u0000user-friendly and compatible integration scripts that enable non-experts to\u0000easily incorporate this complementary information for various tasks. We conduct\u0000extensive experiments on selected datasets to offer recommendations on suitable\u0000machine learning baselines for specific applications. Through CMOB, we aim to\u0000facilitate algorithmic advances and hasten the development, validation, and\u0000clinical translation of machine-learning models for personalized cancer\u0000treatments. CMOB is available on GitHub\u0000(url{https://github.com/chenzRG/Cancer-Multi-Omics-Benchmark}).","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motivation: Burrows-Wheeler Transform (BWT) is a common component in full-text indices. Initially developed for data compression, it is particularly powerful for encoding redundant sequences such as pangenome data. However, BWT construction is resource intensive and hard to be parallelized, and many methods for querying large full-text indices only report exact matches or their simple extensions. These limitations have hampered the biological applications of full-text indices. Results: We developed ropebwt3 for efficient BWT construction and query. Ropebwt3 could index 100 assembled human genomes in 21 hours and index 7.3 terabases of commonly studied bacterial assemblies in 26 days. This was achieved using 82 gigabytes of memory at the peak without working disk space. Ropebwt3 can find maximal exact matches and inexact alignments under affine-gap penalties, and can retrieve all distinct local haplotypes matching a query sequence. It demonstrates the feasibility of full-text indexing at the terabase scale. Availability and implementation: https://github.com/lh3/ropebwt3
{"title":"BWT construction and search at the terabase scale","authors":"Heng Li","doi":"arxiv-2409.00613","DOIUrl":"https://doi.org/arxiv-2409.00613","url":null,"abstract":"Motivation: Burrows-Wheeler Transform (BWT) is a common component in\u0000full-text indices. Initially developed for data compression, it is particularly\u0000powerful for encoding redundant sequences such as pangenome data. However, BWT\u0000construction is resource intensive and hard to be parallelized, and many\u0000methods for querying large full-text indices only report exact matches or their\u0000simple extensions. These limitations have hampered the biological applications\u0000of full-text indices. Results: We developed ropebwt3 for efficient BWT construction and query.\u0000Ropebwt3 could index 100 assembled human genomes in 21 hours and index 7.3\u0000terabases of commonly studied bacterial assemblies in 26 days. This was\u0000achieved using 82 gigabytes of memory at the peak without working disk space.\u0000Ropebwt3 can find maximal exact matches and inexact alignments under affine-gap\u0000penalties, and can retrieve all distinct local haplotypes matching a query\u0000sequence. It demonstrates the feasibility of full-text indexing at the terabase\u0000scale. Availability and implementation: https://github.com/lh3/ropebwt3","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142180955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}