Mario Banuelos, Omar DeGuchy, Suzanne S. Sindi, Roummel F. Marcia
{"title":"Related Inference: A Supervised Learning Approach to Detect Signal Variation in Genome Data","authors":"Mario Banuelos, Omar DeGuchy, Suzanne S. Sindi, Roummel F. Marcia","doi":"10.23919/Eusipco47968.2020.9287597","DOIUrl":null,"url":null,"abstract":"The human genome, composed of nucleotides, is represented by a long sequence of the letters A,C,G,T. Typically, organisms in the same species have similar genomes that differ by only a few sequences of varying lengths at varying positions. These differences can be observed in the form of regions where letters are inserted, deleted or inverted. These anomalies are known as structural variants (SVs) and are difficult to detect. The standard approach for identifying SVs involves comparing fragments of DNA from the genome of interest and comparing them to a reference genome. This process is usually complicated by errors produced in both the sequencing and mapping process which may result in an increase in false positive detections. In this work we propose two different approaches for reducing the number of false positives. We focus our attention on refining deletions detected by the popular SV tool delly. In particular, we consider the ability of simultaneously considering sequencing data from a parent and a child using a neural network and gradient boosting as a post-processing step. We compare the performance of each method on simulated and real parent-child data and show that including related individuals in training data greatly improves the ability to detect true SVs.","PeriodicalId":6705,"journal":{"name":"2020 28th European Signal Processing Conference (EUSIPCO)","volume":"13 1","pages":"1215-1219"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 28th European Signal Processing Conference (EUSIPCO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/Eusipco47968.2020.9287597","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The human genome, composed of nucleotides, is represented by a long sequence of the letters A,C,G,T. Typically, organisms in the same species have similar genomes that differ by only a few sequences of varying lengths at varying positions. These differences can be observed in the form of regions where letters are inserted, deleted or inverted. These anomalies are known as structural variants (SVs) and are difficult to detect. The standard approach for identifying SVs involves comparing fragments of DNA from the genome of interest and comparing them to a reference genome. This process is usually complicated by errors produced in both the sequencing and mapping process which may result in an increase in false positive detections. In this work we propose two different approaches for reducing the number of false positives. We focus our attention on refining deletions detected by the popular SV tool delly. In particular, we consider the ability of simultaneously considering sequencing data from a parent and a child using a neural network and gradient boosting as a post-processing step. We compare the performance of each method on simulated and real parent-child data and show that including related individuals in training data greatly improves the ability to detect true SVs.