{"title":"ntsm: an alignment-free, ultra-low-coverage, sequencing technology agnostic, intraspecies sample comparison tool for sample swap detection.","authors":"Justin Chu, Jiazhen Rong, Xiaowen Feng, Heng Li","doi":"10.1093/gigascience/giae024","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Due to human error, sample swapping in large cohort studies with heterogeneous data types (e.g., mix of Oxford Nanopore Technologies, Pacific Bioscience, Illumina data, etc.) remains a common issue plaguing large-scale studies. At present, all sample swapping detection methods require costly and unnecessary (e.g., if data are only used for genome assembly) alignment, positional sorting, and indexing of the data in order to compare similarly. As studies include more samples and new sequencing data types, robust quality control tools will become increasingly important.</p><p><strong>Findings: </strong>The similarity between samples can be determined using indexed k-mer sequence variants. To increase statistical power, we use coverage information on variant sites, calculating similarity using a likelihood ratio-based test. Per sample error rate, and coverage bias (i.e., missing sites) can also be estimated with this information, which can be used to determine if a spatially indexed principal component analysis (PCA)-based prescreening method can be used, which can greatly speed up analysis by preventing exhaustive all-to-all comparisons.</p><p><strong>Conclusions: </strong>Because this tool processes raw data, is faster than alignment, and can be used on very low-coverage data, it can save an immense degree of computational resources in standard quality control (QC) pipelines. It is robust enough to be used on different sequencing data types, important in studies that leverage the strengths of different sequencing technologies. In addition to its primary use case of sample swap detection, this method also provides information useful in QC, such as error rate and coverage bias, as well as population-level PCA ancestry analysis visualization.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":11.8000,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11148594/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"GigaScience","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/gigascience/giae024","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Due to human error, sample swapping in large cohort studies with heterogeneous data types (e.g., mix of Oxford Nanopore Technologies, Pacific Bioscience, Illumina data, etc.) remains a common issue plaguing large-scale studies. At present, all sample swapping detection methods require costly and unnecessary (e.g., if data are only used for genome assembly) alignment, positional sorting, and indexing of the data in order to compare similarly. As studies include more samples and new sequencing data types, robust quality control tools will become increasingly important.
Findings: The similarity between samples can be determined using indexed k-mer sequence variants. To increase statistical power, we use coverage information on variant sites, calculating similarity using a likelihood ratio-based test. Per sample error rate, and coverage bias (i.e., missing sites) can also be estimated with this information, which can be used to determine if a spatially indexed principal component analysis (PCA)-based prescreening method can be used, which can greatly speed up analysis by preventing exhaustive all-to-all comparisons.
Conclusions: Because this tool processes raw data, is faster than alignment, and can be used on very low-coverage data, it can save an immense degree of computational resources in standard quality control (QC) pipelines. It is robust enough to be used on different sequencing data types, important in studies that leverage the strengths of different sequencing technologies. In addition to its primary use case of sample swap detection, this method also provides information useful in QC, such as error rate and coverage bias, as well as population-level PCA ancestry analysis visualization.
期刊介绍:
GigaScience seeks to transform data dissemination and utilization in the life and biomedical sciences. As an online open-access open-data journal, it specializes in publishing "big-data" studies encompassing various fields. Its scope includes not only "omic" type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale shareable data.