Pub Date : 2016-12-01DOI: 10.1109/BIBM.2016.7822617
Jiajie Peng, Hansheng Xue, Y. Shao, Xuequn Shang, Yadong Wang, Jin Chen
It is critical yet remains to be challenging to make right disease diagnosis based on complex clinical characteristic and heterogeneous genetic background. Recently, Human Phenotype Ontology (HPO)-based phenotype similarity has been widely used to aid disease diagnosis. However, the existing measurements are revised based on the Gene Ontology-based term similarity models, which are not optimized for human phenotype ontologies. We propose a new similarity measure called PhenoSim. Our model includes a noise reduction component to model the noisy patient phenotype data, and a path-constrained Information Content-based method for measuring phenotype semantics similarity. Evaluation tests showed that PhenoSim could improve the performance of HPO-based phenotype similarity measurement.
{"title":"Measuring phenotype semantic similarity using Human Phenotype Ontology","authors":"Jiajie Peng, Hansheng Xue, Y. Shao, Xuequn Shang, Yadong Wang, Jin Chen","doi":"10.1109/BIBM.2016.7822617","DOIUrl":"https://doi.org/10.1109/BIBM.2016.7822617","url":null,"abstract":"It is critical yet remains to be challenging to make right disease diagnosis based on complex clinical characteristic and heterogeneous genetic background. Recently, Human Phenotype Ontology (HPO)-based phenotype similarity has been widely used to aid disease diagnosis. However, the existing measurements are revised based on the Gene Ontology-based term similarity models, which are not optimized for human phenotype ontologies. We propose a new similarity measure called PhenoSim. Our model includes a noise reduction component to model the noisy patient phenotype data, and a path-constrained Information Content-based method for measuring phenotype semantics similarity. Evaluation tests showed that PhenoSim could improve the performance of HPO-based phenotype similarity measurement.","PeriodicalId":345384,"journal":{"name":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"2021 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121289931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/BIBM.2016.7822667
L. Chan, S. Wong, W. H. Chiu
Electronic Health Record (EHR) system is not only aimed to provide a digital and structural form of patient records but also support the clinical decision, patient care and patient advice. The EHR database is still an under-explored big data resource that has hosted a large number of cases with complete recovery, good prognosis, reliable diagnostic tests and effective treatments. A set of 112 abdominal computed tomography imaging examination reports, consisting of 59 cases of hepatocellular carcinoma (HCC) or liver metastases (so called HCC group for simplicity) and 53 cases with no abnormality detected (NAD group), was collected from four hospitals in Hong Kong. We extracted terms related to liver cancer from the reports and mapped them to ontological features using Systematized Nomenclature of Medicine (SNOMED) Clinical Terms (CT). Each feature value was further weighted using a systematic PubMed search method. Association levels between every two features in HCC and NAD groups were quantified using Pearson's correlation coefficient. The distribution of association levels in HCC group was compared with that in NAD group. HCC group reveals a distinct association pattern that signifies liver cancer and provides clinical decision support for suspected cases.
{"title":"Ontological features of Electronic Health Records reveal distinct association patterns in liver cancer","authors":"L. Chan, S. Wong, W. H. Chiu","doi":"10.1109/BIBM.2016.7822667","DOIUrl":"https://doi.org/10.1109/BIBM.2016.7822667","url":null,"abstract":"Electronic Health Record (EHR) system is not only aimed to provide a digital and structural form of patient records but also support the clinical decision, patient care and patient advice. The EHR database is still an under-explored big data resource that has hosted a large number of cases with complete recovery, good prognosis, reliable diagnostic tests and effective treatments. A set of 112 abdominal computed tomography imaging examination reports, consisting of 59 cases of hepatocellular carcinoma (HCC) or liver metastases (so called HCC group for simplicity) and 53 cases with no abnormality detected (NAD group), was collected from four hospitals in Hong Kong. We extracted terms related to liver cancer from the reports and mapped them to ontological features using Systematized Nomenclature of Medicine (SNOMED) Clinical Terms (CT). Each feature value was further weighted using a systematic PubMed search method. Association levels between every two features in HCC and NAD groups were quantified using Pearson's correlation coefficient. The distribution of association levels in HCC group was compared with that in NAD group. HCC group reveals a distinct association pattern that signifies liver cancer and provides clinical decision support for suspected cases.","PeriodicalId":345384,"journal":{"name":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116247100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/BIBM.2016.7822646
Xiujuan Sun, Fa Zhang, Xiaohua Wan, Jinzhi Zhang
In order to enable people to avoid too many cumbersome and complex operations of the command line and repeated parameter adjustments, automates pair-end whole genome re-sequence (aWGRS) data processing whereby pre-installed dependencies are presented in this paper, which are used to map reads to a reference and realign variations. This method presents aWGRS which is a method that takes as input paired-end reads and a reference genome and returns re-sequencing information. The concept behind the development of this tool is that re-sequencing requires several steps: alignment to the reference, single nucleotide polymorphisms (SNPs) calling, Insertion / Deletion (InDels) calling, structure variant (SVs) calling, and annotation. By introducing and adjusting a new concept called the recall rate, the coverage rate and accuracy rate can be met at the same time. Within the range of recall rate, a variation is evaluated by two criteria: the quality value and the number of reads that support it, and one read with higher quality value and larger supported number will be picked out finally. Genome-wide genetic variations between precocious trifoliate orange and its wild type are identified in [1], and empirical results show that there is a big reduction in the amount of variation and great improvement of accuracy between the results of aWGRS and [1] which offered by the Beijing Genomics Institute (BGI). Overall, the adjustable parameters adopted in aWGRS can affect the results of the experiment and the default filtering strategy using the mutation recall rate also can attain good results automatically.
{"title":"aWGRS: Automates paired-end whole genome re-sequencing data analysis framework","authors":"Xiujuan Sun, Fa Zhang, Xiaohua Wan, Jinzhi Zhang","doi":"10.1109/BIBM.2016.7822646","DOIUrl":"https://doi.org/10.1109/BIBM.2016.7822646","url":null,"abstract":"In order to enable people to avoid too many cumbersome and complex operations of the command line and repeated parameter adjustments, automates pair-end whole genome re-sequence (aWGRS) data processing whereby pre-installed dependencies are presented in this paper, which are used to map reads to a reference and realign variations. This method presents aWGRS which is a method that takes as input paired-end reads and a reference genome and returns re-sequencing information. The concept behind the development of this tool is that re-sequencing requires several steps: alignment to the reference, single nucleotide polymorphisms (SNPs) calling, Insertion / Deletion (InDels) calling, structure variant (SVs) calling, and annotation. By introducing and adjusting a new concept called the recall rate, the coverage rate and accuracy rate can be met at the same time. Within the range of recall rate, a variation is evaluated by two criteria: the quality value and the number of reads that support it, and one read with higher quality value and larger supported number will be picked out finally. Genome-wide genetic variations between precocious trifoliate orange and its wild type are identified in [1], and empirical results show that there is a big reduction in the amount of variation and great improvement of accuracy between the results of aWGRS and [1] which offered by the Beijing Genomics Institute (BGI). Overall, the adjustable parameters adopted in aWGRS can affect the results of the experiment and the default filtering strategy using the mutation recall rate also can attain good results automatically.","PeriodicalId":345384,"journal":{"name":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116354110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/BIBM.2016.7822581
A. Athreya, Alan J. Gaglio, Z. Kalbarczyk, R. Iyer, J. Cairns, Krishna R. Kalari, R. Weinshilboum, Liewei Wang
This paper demonstrates an unsupervised learning approach to identify genes with significant differential expression across single-cell subpopulations induced by therapeutic treatment. Identifying this set of genes makes it possible to use well-established bioinformatics approaches such as pathway analysis to establish their biological relevance. Then, a biologist can use his/her prior knowledge to investigate in the laboratory, a few particular candidates among the subset of genes overlapping with relevant pathways. Due to the large size of the human genome and limitations in cost and skilled resources, biologists benefit from analytical methods combined with pathway analysis to design laboratory experiments focusing on only a few significant genes. As an example, we show how model-based unsupervised methods can identify a small set of genes (1% of the genome) that have significant differential expression in single-cells and are also highly correlated to pathways (p-value < 1E − 7) with anticancer effects driven by the antidiabetic drug metformin. Further analysis of genes on these relevant pathways reveal three candidate genes previously implicated in several anticancer mechanisms in other cancers, not driven by metformin. Identification of these genes can help biologists and clinicians design laboratory experiments to establish the molecular mechanisms of metformin in triple-negative breast cancer. In a domain where there is no prior knowledge of small biologically significant data, we demonstrate that careful data-driven methods can infer such significant small data to explain biological mechanisms.
{"title":"Unsupervised single-cell analysis in triple-negative breast cancer: A case study","authors":"A. Athreya, Alan J. Gaglio, Z. Kalbarczyk, R. Iyer, J. Cairns, Krishna R. Kalari, R. Weinshilboum, Liewei Wang","doi":"10.1109/BIBM.2016.7822581","DOIUrl":"https://doi.org/10.1109/BIBM.2016.7822581","url":null,"abstract":"This paper demonstrates an unsupervised learning approach to identify genes with significant differential expression across single-cell subpopulations induced by therapeutic treatment. Identifying this set of genes makes it possible to use well-established bioinformatics approaches such as pathway analysis to establish their biological relevance. Then, a biologist can use his/her prior knowledge to investigate in the laboratory, a few particular candidates among the subset of genes overlapping with relevant pathways. Due to the large size of the human genome and limitations in cost and skilled resources, biologists benefit from analytical methods combined with pathway analysis to design laboratory experiments focusing on only a few significant genes. As an example, we show how model-based unsupervised methods can identify a small set of genes (1% of the genome) that have significant differential expression in single-cells and are also highly correlated to pathways (p-value < 1E − 7) with anticancer effects driven by the antidiabetic drug metformin. Further analysis of genes on these relevant pathways reveal three candidate genes previously implicated in several anticancer mechanisms in other cancers, not driven by metformin. Identification of these genes can help biologists and clinicians design laboratory experiments to establish the molecular mechanisms of metformin in triple-negative breast cancer. In a domain where there is no prior knowledge of small biologically significant data, we demonstrate that careful data-driven methods can infer such significant small data to explain biological mechanisms.","PeriodicalId":345384,"journal":{"name":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114843091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/BIBM.2016.7822756
Qianqian Huang, Xiaolong Zhang
In the protein-protein interactions, only a small subset of hot spot residues contributes significantly to the binding free energy. Therefore, there is an imbalance between the number of hot spots and non-hot spots. The prediction of hot spot residues is very important in the protein-protein interaction. This paper presents an improved ensemble learning method-Adaboost with SMOTE method to deal with the imbalanced data and predict protein hot spots in the latest database SKEMPI. Firstly, the amino acid information such as hydrophobicity of the amino acid and protein structural features is exacted. Then mRMR algorithm was used to select the features. Finally, the protein database is further handled by SMOTE to deal with the imbalance data, the protein hot spots are predicted by the ensemble learning method-Adaboost. Experimental results show that the proposed method has the ability to improve the predict accuracy.
{"title":"An improved ensemble learning method with SMOTE for protein interaction hot spots prediction","authors":"Qianqian Huang, Xiaolong Zhang","doi":"10.1109/BIBM.2016.7822756","DOIUrl":"https://doi.org/10.1109/BIBM.2016.7822756","url":null,"abstract":"In the protein-protein interactions, only a small subset of hot spot residues contributes significantly to the binding free energy. Therefore, there is an imbalance between the number of hot spots and non-hot spots. The prediction of hot spot residues is very important in the protein-protein interaction. This paper presents an improved ensemble learning method-Adaboost with SMOTE method to deal with the imbalanced data and predict protein hot spots in the latest database SKEMPI. Firstly, the amino acid information such as hydrophobicity of the amino acid and protein structural features is exacted. Then mRMR algorithm was used to select the features. Finally, the protein database is further handled by SMOTE to deal with the imbalance data, the protein hot spots are predicted by the ensemble learning method-Adaboost. Experimental results show that the proposed method has the ability to improve the predict accuracy.","PeriodicalId":345384,"journal":{"name":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124483848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/BIBM.2016.7822541
Yi Zhang, Xinan Liu, J. MacLeod, Jinze Liu
Alternative splicing (AS) is a regulated process that enables the production of multiple mRNA transcripts from a single multi-exon gene. The availability of large-scale RNA-seq datasets has made it possible to predict splice junctions, as well as splice sites through spliced alignment to the reference genome. This greatly enhances the capability to decipher gene structures and explore the diversity of splicing variants. However, existing ab initio aligners are vulnerable to false positive spliced alignments as a result of sequence errors and random sequence matches. These spurious alignments can lead to a significant set of false positive splice junction predictions, confusing downstream analyses of splice variant detection and abundance estimation. In this work, we illustrate that splice junction sequence characteristics can be ascertained from experimental data with deep learning techniques. We employ deep convolutional neural networks for a novel splice junction classification tool named DeepSplice that (i) outperforms state-of-the-art methods for predicting splice sites, (ii) shows high computational efficiency and (iii) can be applied to self-defined training data by users.
{"title":"DeepSplice: Deep classification of novel splice junctions revealed by RNA-seq","authors":"Yi Zhang, Xinan Liu, J. MacLeod, Jinze Liu","doi":"10.1109/BIBM.2016.7822541","DOIUrl":"https://doi.org/10.1109/BIBM.2016.7822541","url":null,"abstract":"Alternative splicing (AS) is a regulated process that enables the production of multiple mRNA transcripts from a single multi-exon gene. The availability of large-scale RNA-seq datasets has made it possible to predict splice junctions, as well as splice sites through spliced alignment to the reference genome. This greatly enhances the capability to decipher gene structures and explore the diversity of splicing variants. However, existing ab initio aligners are vulnerable to false positive spliced alignments as a result of sequence errors and random sequence matches. These spurious alignments can lead to a significant set of false positive splice junction predictions, confusing downstream analyses of splice variant detection and abundance estimation. In this work, we illustrate that splice junction sequence characteristics can be ascertained from experimental data with deep learning techniques. We employ deep convolutional neural networks for a novel splice junction classification tool named DeepSplice that (i) outperforms state-of-the-art methods for predicting splice sites, (ii) shows high computational efficiency and (iii) can be applied to self-defined training data by users.","PeriodicalId":345384,"journal":{"name":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127716584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/BIBM.2016.7822578
Qian Zhu, Anirudh Akkati, Pornpoh Hongwattanakul
About 382 million people have Diabetes in 2013, and the International Diabetes Federation estimated that there are 4.9 million people died from Diabetes in 2014. Diabetes continues to be a chronic disease plagued by frequent hospital readmissions. In order to better understand the risk features impacting readmissions for future prevention and management, in this study, we programmatically analyzed a large clinical dataset containing more than 100,000 clinical records for diabetes patients from 130 US hospitals. Specifically, we developed three different machine learning algorithms, Logistic Regression, Random Forest and manipulated Random Forest to identify and prioritize the most significant risk features. By comparing the results generated by these three methods, the manipulated Random Forest illustrates greater capacity of generating a more complete and concrete list of readmission related risk features. Such method is generalizable and can be applied in other disease oriented studies.
{"title":"Risk feature assessment of readmission for diabetes","authors":"Qian Zhu, Anirudh Akkati, Pornpoh Hongwattanakul","doi":"10.1109/BIBM.2016.7822578","DOIUrl":"https://doi.org/10.1109/BIBM.2016.7822578","url":null,"abstract":"About 382 million people have Diabetes in 2013, and the International Diabetes Federation estimated that there are 4.9 million people died from Diabetes in 2014. Diabetes continues to be a chronic disease plagued by frequent hospital readmissions. In order to better understand the risk features impacting readmissions for future prevention and management, in this study, we programmatically analyzed a large clinical dataset containing more than 100,000 clinical records for diabetes patients from 130 US hospitals. Specifically, we developed three different machine learning algorithms, Logistic Regression, Random Forest and manipulated Random Forest to identify and prioritize the most significant risk features. By comparing the results generated by these three methods, the manipulated Random Forest illustrates greater capacity of generating a more complete and concrete list of readmission related risk features. Such method is generalizable and can be applied in other disease oriented studies.","PeriodicalId":345384,"journal":{"name":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127718034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/BIBM.2016.7822571
Jianyu Shi, Ke Gao, Xuequn Shang, S. Yiu
There is an urgent need to discover or predict DDIs, which would cause serious adverse drug reactions. However, preclinical detection of DDIs bear high cost. Similarity-based computational approaches can be the assistance of experimental approaches. Utilizing pre-market drug similarities, they are able to predict DDIs on a large scale. However, they neglect the topological structure among DDIs and non-DDIs and have a burden of slow training and much memory. Or, they bear the bias that the pairs between a newly-given drug and the drugs having many DDIs tend to obtain high ranks. More importantly, they lack an effective combination of multiple predictions. To address these issues, we develop a local classification-based model (LCM), which has the advantages of faster training, less memory requirement as well as no that bias. We further design a novel supervised algorithm of fusion based on Dempster-Shafer (DS) theory of evidence for combine multiple predictions. Finally, the experiments demonstrate that our LCM-DS is significantly superior to three state-of-the-art approaches and outperforms both individual LCMs and classical fusion algorithms.
{"title":"LCM-DS: A novel approach of predicting drug-drug interactions for new drugs via Dempster-Shafer theory of evidence","authors":"Jianyu Shi, Ke Gao, Xuequn Shang, S. Yiu","doi":"10.1109/BIBM.2016.7822571","DOIUrl":"https://doi.org/10.1109/BIBM.2016.7822571","url":null,"abstract":"There is an urgent need to discover or predict DDIs, which would cause serious adverse drug reactions. However, preclinical detection of DDIs bear high cost. Similarity-based computational approaches can be the assistance of experimental approaches. Utilizing pre-market drug similarities, they are able to predict DDIs on a large scale. However, they neglect the topological structure among DDIs and non-DDIs and have a burden of slow training and much memory. Or, they bear the bias that the pairs between a newly-given drug and the drugs having many DDIs tend to obtain high ranks. More importantly, they lack an effective combination of multiple predictions. To address these issues, we develop a local classification-based model (LCM), which has the advantages of faster training, less memory requirement as well as no that bias. We further design a novel supervised algorithm of fusion based on Dempster-Shafer (DS) theory of evidence for combine multiple predictions. Finally, the experiments demonstrate that our LCM-DS is significantly superior to three state-of-the-art approaches and outperforms both individual LCMs and classical fusion algorithms.","PeriodicalId":345384,"journal":{"name":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"185 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126272708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tuning bioinformatics pipelines and training software parameters require sequencing data with known ground truth, which are actually difficult to get from real sequencing data. Particularly, for those applications of detecting low frequency variations (like ctDNA sequencing), it is hard to tell whether a called variation is a true positive, or a false positive caused by errors from sequencing or other processes. In these cases, simulated data with configured variations can be used to troubleshoot and validate bioinformatics programs. Although lots of next generation sequencing simulators have already been developed, most of them lack of capability to simulate lots of practical features, such like target capturing sequencing, copy number variations, gene fusions, amplification bias and sequencing errors. In this paper, we will present SeqMaker, a modern NGS simulator with capability to simulate different kinds of variations, with amplification bias and sequencing errors integrated. Target capturing sequencing is simply supported by using a capturing panel description file, other characteristics like sequencing error rate, average duplication level, DNA template length distribution and quality distribution can be easily configured with a simple JSON format profile file. With the integration sequencing errors and amplification bias, SeqMaker is able to simulate more real next generation sequencing data. The configurable variants and capturing regions make SeqMaker very useful to generate data for training bioinformatics pipelines for applications like somatic mutation calling.
{"title":"SeqMaker: A next generation sequencing simulator with variations, sequencing errors and amplification bias integrated","authors":"Shifu Chen, Yue Han, Lanting Guo, Jing-Shan Hu, Jia Gu","doi":"10.1109/BIBM.2016.7822634","DOIUrl":"https://doi.org/10.1109/BIBM.2016.7822634","url":null,"abstract":"Tuning bioinformatics pipelines and training software parameters require sequencing data with known ground truth, which are actually difficult to get from real sequencing data. Particularly, for those applications of detecting low frequency variations (like ctDNA sequencing), it is hard to tell whether a called variation is a true positive, or a false positive caused by errors from sequencing or other processes. In these cases, simulated data with configured variations can be used to troubleshoot and validate bioinformatics programs. Although lots of next generation sequencing simulators have already been developed, most of them lack of capability to simulate lots of practical features, such like target capturing sequencing, copy number variations, gene fusions, amplification bias and sequencing errors. In this paper, we will present SeqMaker, a modern NGS simulator with capability to simulate different kinds of variations, with amplification bias and sequencing errors integrated. Target capturing sequencing is simply supported by using a capturing panel description file, other characteristics like sequencing error rate, average duplication level, DNA template length distribution and quality distribution can be easily configured with a simple JSON format profile file. With the integration sequencing errors and amplification bias, SeqMaker is able to simulate more real next generation sequencing data. The configurable variants and capturing regions make SeqMaker very useful to generate data for training bioinformatics pipelines for applications like somatic mutation calling.","PeriodicalId":345384,"journal":{"name":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"11 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121591070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01DOI: 10.1109/BIBM.2016.7822601
Lihua Zhang, Rong Li, Qiuping Yang, Yanan Wu, Jingshan Huang, Bin Wu
Diabetic kidney disease (DKD) is a serious disease that presents a major health problem worldwide. There is a desperate need to explore novel biomarkers to further facilitate the early diagnosis and effective treatment in DKD patients so that to prevent them to develop end-stage renal disease (ESRD). However, most of regulation mechanisms at genetic level in DKD still remain unclear. In this work-in-progress paper, we describe our innovative methodologies that integrate biological, statistics, and computational approaches to investigate important roles performed by regulations among microRNAs (miRs), long non-coding RNAs (lncRNAs), and messenger RNAs (mRNAs) in DKD. We conducted a series of experiments and identified a list of miRs and lncRNAs as potential novel biomarkers, along with the set of target genes regulated by discovered miRs. Our initial analysis results are promising in better understanding regulation mechanisms of miRs and lncRNAs on the pathogenesis and progression of DKD.
{"title":"Innovative microRNA-lncRNA-mRNA co-expression analysis to understand the pathogenesis and progression of diabetic kidney disease","authors":"Lihua Zhang, Rong Li, Qiuping Yang, Yanan Wu, Jingshan Huang, Bin Wu","doi":"10.1109/BIBM.2016.7822601","DOIUrl":"https://doi.org/10.1109/BIBM.2016.7822601","url":null,"abstract":"Diabetic kidney disease (DKD) is a serious disease that presents a major health problem worldwide. There is a desperate need to explore novel biomarkers to further facilitate the early diagnosis and effective treatment in DKD patients so that to prevent them to develop end-stage renal disease (ESRD). However, most of regulation mechanisms at genetic level in DKD still remain unclear. In this work-in-progress paper, we describe our innovative methodologies that integrate biological, statistics, and computational approaches to investigate important roles performed by regulations among microRNAs (miRs), long non-coding RNAs (lncRNAs), and messenger RNAs (mRNAs) in DKD. We conducted a series of experiments and identified a list of miRs and lncRNAs as potential novel biomarkers, along with the set of target genes regulated by discovered miRs. Our initial analysis results are promising in better understanding regulation mechanisms of miRs and lncRNAs on the pathogenesis and progression of DKD.","PeriodicalId":345384,"journal":{"name":"2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115977558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}