Hamidreza Mohebbi, Nurit Haspel, D. Simovici, Joyce Quach
{"title":"Fusion Transcript Detection from RNA-Seq using Jaccard Distance","authors":"Hamidreza Mohebbi, Nurit Haspel, D. Simovici, Joyce Quach","doi":"10.1145/3388440.3415585","DOIUrl":null,"url":null,"abstract":"Gene fusion events are quite common in prostate, lymphoid, soft tissue, breast, gastric, and lung cancers. This requires fast and accurate fusion detection methods. However, accurate identification requires whole genome sequencing. Current state of the art methods suffer from inefficiency, lack of sufficient accuracy, and generation of high false positive rate. In this research we present a parallel method to convert inefficient categorical space into a compact binary array and therefore, reduce the dimensionality of the data and speed up the computation. FDJD pipeline contains three steps: general alignment, fusion candidate generation, and refinement. In our research, Jaccard distance is used as a similarity measure to find the nearest neighbors of a given query binary fingerprint alongside a fast KNN implementation. We benchmarked our fusion prediction accuracy using both simulated and genuine RNA-Seq data sets. Fusion detection results are compared with the state-of-the-art-methods STAR-Fusion, InFusion and TopHat-Fusion. The paired-end Illumina RNA-Seq genuine data were obtained from 60 publicly available cancer cell line data sets. FDJD showed superior performance compared to popular alternative fusion detection methods in both simulated and genuine data sets. It attained 90% accuracy on simulated fusion transcript inputs. Of a total of 86 fusions predicted by at least three methods, we found 44 experimentally validated fusions using wisdom of crowds approach. FDJD is not the fastest among the studied methods. However, it achieved the highest accuracy.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"97 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3388440.3415585","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Gene fusion events are quite common in prostate, lymphoid, soft tissue, breast, gastric, and lung cancers. This requires fast and accurate fusion detection methods. However, accurate identification requires whole genome sequencing. Current state of the art methods suffer from inefficiency, lack of sufficient accuracy, and generation of high false positive rate. In this research we present a parallel method to convert inefficient categorical space into a compact binary array and therefore, reduce the dimensionality of the data and speed up the computation. FDJD pipeline contains three steps: general alignment, fusion candidate generation, and refinement. In our research, Jaccard distance is used as a similarity measure to find the nearest neighbors of a given query binary fingerprint alongside a fast KNN implementation. We benchmarked our fusion prediction accuracy using both simulated and genuine RNA-Seq data sets. Fusion detection results are compared with the state-of-the-art-methods STAR-Fusion, InFusion and TopHat-Fusion. The paired-end Illumina RNA-Seq genuine data were obtained from 60 publicly available cancer cell line data sets. FDJD showed superior performance compared to popular alternative fusion detection methods in both simulated and genuine data sets. It attained 90% accuracy on simulated fusion transcript inputs. Of a total of 86 fusions predicted by at least three methods, we found 44 experimentally validated fusions using wisdom of crowds approach. FDJD is not the fastest among the studied methods. However, it achieved the highest accuracy.