Pub Date : 2010-11-01DOI: 10.1109/GENSIPS.2010.5719691
A. Zollanvari, U. Braga-Neto, E. Dougherty
The validity of a classifier depends on the precision of the error estimator used to estimate its true error. This paper considers the necessary sample size to achieve a given validity measure, namely RMS, for resubstitution and leave-one-out error estimators in the context of LDA. It provides bounds for the RMS between the true error and both the resubstitution and leave-one-out error estimators in terms of sample size and dimensionality. These bounds can be used to determine the minimum sample size in order to obtain a desired estimation accuracy, relative to RMS. To show how these results can be used in practice, a microarray classification problem is presented.
{"title":"RMS bounds and sample size considerations for error estimation in linear discriminant analysis","authors":"A. Zollanvari, U. Braga-Neto, E. Dougherty","doi":"10.1109/GENSIPS.2010.5719691","DOIUrl":"https://doi.org/10.1109/GENSIPS.2010.5719691","url":null,"abstract":"The validity of a classifier depends on the precision of the error estimator used to estimate its true error. This paper considers the necessary sample size to achieve a given validity measure, namely RMS, for resubstitution and leave-one-out error estimators in the context of LDA. It provides bounds for the RMS between the true error and both the resubstitution and leave-one-out error estimators in terms of sample size and dimensionality. These bounds can be used to determine the minimum sample size in order to obtain a desired estimation accuracy, relative to RMS. To show how these results can be used in practice, a microarray classification problem is presented.","PeriodicalId":388703,"journal":{"name":"2010 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122497947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-11-01DOI: 10.1109/GENSIPS.2010.5719688
Mohammadmahdi R. Yousefi, Jianping Hua, Chao Sima, E. Dougherty
When proposing a new classification scheme, perhaps in the form of a classification rule or feature selection method, modelers in the bioinformatics literature typically report its performance on data sets of interest, such as gene-expression microarrays. These data sets often include thousands of features but a small number of sample points, which increases variability in feature selection and error estimation, resulting in highly imprecise reported performances. This suggests that the reported performance of the proposed scheme would be less correlated with and highly biased from the actual performance if only the best results are demonstrated. This paper confirms this by showing the behavior of the joint distributions of the minimum reported estimated errors and corresponding true errors as functions of the number of samples tested in a large simulation study using both modeled and real data.
{"title":"Effects of partial reporting of classification results","authors":"Mohammadmahdi R. Yousefi, Jianping Hua, Chao Sima, E. Dougherty","doi":"10.1109/GENSIPS.2010.5719688","DOIUrl":"https://doi.org/10.1109/GENSIPS.2010.5719688","url":null,"abstract":"When proposing a new classification scheme, perhaps in the form of a classification rule or feature selection method, modelers in the bioinformatics literature typically report its performance on data sets of interest, such as gene-expression microarrays. These data sets often include thousands of features but a small number of sample points, which increases variability in feature selection and error estimation, resulting in highly imprecise reported performances. This suggests that the reported performance of the proposed scheme would be less correlated with and highly biased from the actual performance if only the best results are demonstrated. This paper confirms this by showing the behavior of the joint distributions of the minimum reported estimated errors and corresponding true errors as functions of the number of samples tested in a large simulation study using both modeled and real data.","PeriodicalId":388703,"journal":{"name":"2010 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127519336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-11-01DOI: 10.1109/GENSIPS.2010.5719675
Shreepriya Das, H. Vikalo, A. Hassibi
In this paper, we study the efficacy of a model-based base-calling approach for Illumina's sequencing platforms. In particular, we investigate Genome Analyzer I reads and provide a detailed biochemical model of the sequencing process, incorporating various non-idealities evident in such systems. Parameters of the model are estimated via a supervised learning based on the particle swarm optimization technique. A computationally efficient sequential decoding method is proposed for base-calling. It is demonstrated that the performance of the proposed approach is comparable to Illumina's base-calling method.
{"title":"Model-based sequential base calling for Illumina sequencing","authors":"Shreepriya Das, H. Vikalo, A. Hassibi","doi":"10.1109/GENSIPS.2010.5719675","DOIUrl":"https://doi.org/10.1109/GENSIPS.2010.5719675","url":null,"abstract":"In this paper, we study the efficacy of a model-based base-calling approach for Illumina's sequencing platforms. In particular, we investigate Genome Analyzer I reads and provide a detailed biochemical model of the sequencing process, incorporating various non-idealities evident in such systems. Parameters of the model are estimated via a supervised learning based on the particle swarm optimization technique. A computationally efficient sequential decoding method is proposed for base-calling. It is demonstrated that the performance of the proposed approach is comparable to Illumina's base-calling method.","PeriodicalId":388703,"journal":{"name":"2010 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129986131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-10-22DOI: 10.1109/GENSIPS.2010.5719681
R. Pal, M. Caglar
Stochastic master equation (SME) models can provide detailed representation of genetic regulatory system but their use is restricted by the large data requirements for parameter inference and inherent computational complexity involved in its simulation. In this paper, we approximate the expected value of the output distribution of the SME by the output of a deterministic Differential Equation (DE) model. The mapping provides a technique to simulate the average behavior of the system in a computationally inexpensive manner and enables us to use existing tools for DE models to control the system. The effectiveness of the mapping and the subsequent intervention policy design was evaluated through a biological example.
{"title":"Control of stochastic master equation models of genetic regulatory networks by approximating their average behavior","authors":"R. Pal, M. Caglar","doi":"10.1109/GENSIPS.2010.5719681","DOIUrl":"https://doi.org/10.1109/GENSIPS.2010.5719681","url":null,"abstract":"Stochastic master equation (SME) models can provide detailed representation of genetic regulatory system but their use is restricted by the large data requirements for parameter inference and inherent computational complexity involved in its simulation. In this paper, we approximate the expected value of the output distribution of the SME by the output of a deterministic Differential Equation (DE) model. The mapping provides a technique to simulate the average behavior of the system in a computationally inexpensive manner and enables us to use existing tools for DE models to control the system. The effectiveness of the mapping and the subsequent intervention policy design was evaluated through a biological example.","PeriodicalId":388703,"journal":{"name":"2010 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122653135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-05-26DOI: 10.1109/GENSIPS.2010.5719678
P. Lin, S. Khatri
The inference of gene predictors in the gene regulatory network (GRN) has become an important research area in the genomics and medical disciplines. Accurate predicators are necessary for constructing the GRN model and to enable targeted biological experiments that attempt to validate or control the regulation process. In this paper, we implement a SAT-based algorithm to determine the gene predictor set from steady state gene expression data (attractor states). Using the attractor states as input, the states are ordered into attractor cycles. For each attractor cycle ordering, all possible predictors are enumerated and a conjunctive normal form (CNF) expression is generated which encodes these predictors and their biological constraints. Each CNF is solved using a SAT solver to find candidate predictor sets. Statistical analysis of the resulting predictor sets selects the most likely predictor set of the GRN, corresponding to the attractor data. We demonstrate our algorithm on attractor state data from a melanoma study [1] and present our predictor set results.
{"title":"Inference of gene predictor set using Boolean satisfiability","authors":"P. Lin, S. Khatri","doi":"10.1109/GENSIPS.2010.5719678","DOIUrl":"https://doi.org/10.1109/GENSIPS.2010.5719678","url":null,"abstract":"The inference of gene predictors in the gene regulatory network (GRN) has become an important research area in the genomics and medical disciplines. Accurate predicators are necessary for constructing the GRN model and to enable targeted biological experiments that attempt to validate or control the regulation process. In this paper, we implement a SAT-based algorithm to determine the gene predictor set from steady state gene expression data (attractor states). Using the attractor states as input, the states are ordered into attractor cycles. For each attractor cycle ordering, all possible predictors are enumerated and a conjunctive normal form (CNF) expression is generated which encodes these predictors and their biological constraints. Each CNF is solved using a SAT solver to find candidate predictor sets. Statistical analysis of the resulting predictor sets selects the most likely predictor set of the GRN, corresponding to the attractor data. We demonstrate our algorithm on attractor state data from a melanoma study [1] and present our predictor set results.","PeriodicalId":388703,"journal":{"name":"2010 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129985632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}