Pub Date : 2013-11-14DOI: 10.1109/HIBIT.2013.6661687
Ercument M. Eser, B. Arslan, U. Sezerman
Motif extraction from protein sequences has been a challenging task for bioinformaticians. Class-specific motifs, which are frequently found in one class but are in small ratio in other classes can be used for highly accurate classification of protein sequences. In this study, we present a new scoring based method for class-specific n-gram motif selection using reduced amino acid alphabets. Cohesin protein sequences, which interact with Dockerin modules to construct the most common and abundant organic polymer Cellulosome is used for class specific motif selection, and selected motifs are then given to J48 and SVM algorithms as features. Results of classification are examined with parameters of various n-gram sizes, reduced amino acid alphabets and feature number. Result with training accuracy of 98.61 % and test accuracy of 94.54 %, was found to be best one using Gbmr14 alphabet, 5 features per family, 4-gram motifs and J48 algorithm. The proposed technique can be generalized to use for other protein families.
{"title":"Classification of cohesin family using class specific motifs","authors":"Ercument M. Eser, B. Arslan, U. Sezerman","doi":"10.1109/HIBIT.2013.6661687","DOIUrl":"https://doi.org/10.1109/HIBIT.2013.6661687","url":null,"abstract":"Motif extraction from protein sequences has been a challenging task for bioinformaticians. Class-specific motifs, which are frequently found in one class but are in small ratio in other classes can be used for highly accurate classification of protein sequences. In this study, we present a new scoring based method for class-specific n-gram motif selection using reduced amino acid alphabets. Cohesin protein sequences, which interact with Dockerin modules to construct the most common and abundant organic polymer Cellulosome is used for class specific motif selection, and selected motifs are then given to J48 and SVM algorithms as features. Results of classification are examined with parameters of various n-gram sizes, reduced amino acid alphabets and feature number. Result with training accuracy of 98.61 % and test accuracy of 94.54 %, was found to be best one using Gbmr14 alphabet, 5 features per family, 4-gram motifs and J48 algorithm. The proposed technique can be generalized to use for other protein families.","PeriodicalId":433206,"journal":{"name":"2013 8th International Symposium on Health Informatics and Bioinformatics","volume":"259 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121409724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-11-14DOI: 10.1109/HIBIT.2013.6661684
S. Manafi, A. Uyar, A. Bener
The actual benefit from high-throughput microarray experiments strongly relies on elimination of all possible sources of biases during both the experimental procedure and data analysis process. Within the context of reproductive biology, microarray based transcriptomic analysis of oocyte and surrounding cumulus/granulosa cells poses significant challenges due to limited amount of samples and/or potential contaminations from adjacent cells. In this study, we investigated the effect of sampling bias on consistency of the microarray differential expression analysis in the field of reproduction. Experiments were conducted on five datasets obtained from publicly available microarray repositories. For each dataset, probe level expression values were extracted and background adjustment, inter-array quantile normalization and probe set summarization were performed according to the Robust Multi-Chip Average algorithm. Genes with a false discovery rate-corrected p value of <;0.05 and [Fold Change] > 2 were considered as differentially expressed. Results demonstrate that both number of replicates and including different subsets of available samples in the analysis alter the number of differentially expressed genes. We suggest that assessment of inter-sample variance prior to differential expression analysis is an important step in microarray experiments and proper handling of that variance may require alternative normalization and/or statistical test methods.
{"title":"Sampling bias in microarray data analysis: A demonstration in the field of reproductive biology","authors":"S. Manafi, A. Uyar, A. Bener","doi":"10.1109/HIBIT.2013.6661684","DOIUrl":"https://doi.org/10.1109/HIBIT.2013.6661684","url":null,"abstract":"The actual benefit from high-throughput microarray experiments strongly relies on elimination of all possible sources of biases during both the experimental procedure and data analysis process. Within the context of reproductive biology, microarray based transcriptomic analysis of oocyte and surrounding cumulus/granulosa cells poses significant challenges due to limited amount of samples and/or potential contaminations from adjacent cells. In this study, we investigated the effect of sampling bias on consistency of the microarray differential expression analysis in the field of reproduction. Experiments were conducted on five datasets obtained from publicly available microarray repositories. For each dataset, probe level expression values were extracted and background adjustment, inter-array quantile normalization and probe set summarization were performed according to the Robust Multi-Chip Average algorithm. Genes with a false discovery rate-corrected p value of <;0.05 and [Fold Change] > 2 were considered as differentially expressed. Results demonstrate that both number of replicates and including different subsets of available samples in the analysis alter the number of differentially expressed genes. We suggest that assessment of inter-sample variance prior to differential expression analysis is an important step in microarray experiments and proper handling of that variance may require alternative normalization and/or statistical test methods.","PeriodicalId":433206,"journal":{"name":"2013 8th International Symposium on Health Informatics and Bioinformatics","volume":"2014 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128236731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-11-14DOI: 10.1109/HIBIT.2013.6661686
Canan Has, Cemal Ulas Kundakci, Aybuge Altay, J. Allmer
Proteomics is currently driven by mass spectrometry. For the analysis of tandem mass spectra many computational algorithms have been proposed. There are two approaches, one which assigns a peptide sequence to a tandem mass spectrum directly and one which employs a sequence database for looking up possible solutions. The former method needs high quality spectra while the latter can tolerate lower quality spectra. Since both methods are computationally expensive, it is sensible to establish spectral quality using an independent fast algorithm. In this study, we first establish proper settings for database search algorithms for the analysis of spectra in our gold benchmark dataset and then analyze the performance of ScanRanker, an algorithm for quality assessment of tandem MS spectra, on this ground truth data. We found that OMSSA and MSGFDB have limitations in their scoring functions but were able to form a proper consensus prediction using majority vote for our benchmark data. Unfortunately, ScanRanker's results do not correlate well with the consensus and ScanRanker is also too slow to be used in the capacity it is supposed to be used.
{"title":"Ranking tandem mass spectra: And the impact of database size and scoring function on peptide spectrum matches","authors":"Canan Has, Cemal Ulas Kundakci, Aybuge Altay, J. Allmer","doi":"10.1109/HIBIT.2013.6661686","DOIUrl":"https://doi.org/10.1109/HIBIT.2013.6661686","url":null,"abstract":"Proteomics is currently driven by mass spectrometry. For the analysis of tandem mass spectra many computational algorithms have been proposed. There are two approaches, one which assigns a peptide sequence to a tandem mass spectrum directly and one which employs a sequence database for looking up possible solutions. The former method needs high quality spectra while the latter can tolerate lower quality spectra. Since both methods are computationally expensive, it is sensible to establish spectral quality using an independent fast algorithm. In this study, we first establish proper settings for database search algorithms for the analysis of spectra in our gold benchmark dataset and then analyze the performance of ScanRanker, an algorithm for quality assessment of tandem MS spectra, on this ground truth data. We found that OMSSA and MSGFDB have limitations in their scoring functions but were able to form a proper consensus prediction using majority vote for our benchmark data. Unfortunately, ScanRanker's results do not correlate well with the consensus and ScanRanker is also too slow to be used in the capacity it is supposed to be used.","PeriodicalId":433206,"journal":{"name":"2013 8th International Symposium on Health Informatics and Bioinformatics","volume":"176 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121191196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-11-14DOI: 10.1109/HIBIT.2013.6661683
Y. Isler
A filter is an electrical network or software that alters the amplitude and/or phase characteristics of a signal with respect to frequency. Recently, a new detrending method has been presented to remove the slow nonstationary trends from biomedical signals, which is equivalent to high-pass filtering that removes very low frequency components from the given signal. Although many recently published papers, related to the analysis of biomedical signals like the heart rate variability signal, have used the smoothness priors detrending method, there is no given exact relationship between the regularization parameter and the cut-off frequency of the corresponding high pass filter. In this study, we present this relationship by an empirical formula which would allow the researchers to calculate the parameter from the desired frequency response for not only a high pass filter but also other filter types.
{"title":"Determination of the exact value of the regularization parameter in smoothness priors method with respect to the corresponding cut-off frequencies for designing filters","authors":"Y. Isler","doi":"10.1109/HIBIT.2013.6661683","DOIUrl":"https://doi.org/10.1109/HIBIT.2013.6661683","url":null,"abstract":"A filter is an electrical network or software that alters the amplitude and/or phase characteristics of a signal with respect to frequency. Recently, a new detrending method has been presented to remove the slow nonstationary trends from biomedical signals, which is equivalent to high-pass filtering that removes very low frequency components from the given signal. Although many recently published papers, related to the analysis of biomedical signals like the heart rate variability signal, have used the smoothness priors detrending method, there is no given exact relationship between the regularization parameter and the cut-off frequency of the corresponding high pass filter. In this study, we present this relationship by an empirical formula which would allow the researchers to calculate the parameter from the desired frequency response for not only a high pass filter but also other filter types.","PeriodicalId":433206,"journal":{"name":"2013 8th International Symposium on Health Informatics and Bioinformatics","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130287525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-11-14DOI: 10.1109/HIBIT.2013.6661682
M. Akhmet, M. O. Fen
In this study, we investigate the dynamics of shunting inhibitory cellular neural networks with external inputs in the form of relay functions. The presence of chaos through period-doubling cascade is proved theoretically. An example that confirms the theoretical results is illustrated.
{"title":"Period-doubling route to chaos in shunting inhibitory cellular neural networks","authors":"M. Akhmet, M. O. Fen","doi":"10.1109/HIBIT.2013.6661682","DOIUrl":"https://doi.org/10.1109/HIBIT.2013.6661682","url":null,"abstract":"In this study, we investigate the dynamics of shunting inhibitory cellular neural networks with external inputs in the form of relay functions. The presence of chaos through period-doubling cascade is proved theoretically. An example that confirms the theoretical results is illustrated.","PeriodicalId":433206,"journal":{"name":"2013 8th International Symposium on Health Informatics and Bioinformatics","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121959616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-11-14DOI: 10.1109/HIBIT.2013.6661681
Ozan Ozisik, Burcu Bakir-Gungor, B. Diri, O. U. Sezerman
An active subnetwork is a group of interconnected genes that show condition-specific differences. It has been observed that the gene products that have alterations associated with a disease of interest, incline to be part of the subnetworks among the overall interaction network. Hence, the integration of the interaction data with the genotypic data underlying disease states facilitates the separation of the subnetworks perturbed in a given disorder from the rest of the network. In the literature, active subnetwork search is used to discover disease related regulatory pathways, dysregulated genes, functional modules, cancer markers, to classify diseases, and to predict response to treatment. In this study, a genetic algorithm based method is developed for active subnetwork search and applied to WTCCC Rheumatoid Arthritis genome-wide association study dataset. The relevance of the identified subnetworks against the disease is compared in terms of biological pathways. Our results show that the proposed method works well in detecting the significant RA associated subnetworks, and it is also applicable to recognize subnetworks of other complex diseases.
{"title":"A genetic algorithm approach to active subnetwork search applied to GWAS data","authors":"Ozan Ozisik, Burcu Bakir-Gungor, B. Diri, O. U. Sezerman","doi":"10.1109/HIBIT.2013.6661681","DOIUrl":"https://doi.org/10.1109/HIBIT.2013.6661681","url":null,"abstract":"An active subnetwork is a group of interconnected genes that show condition-specific differences. It has been observed that the gene products that have alterations associated with a disease of interest, incline to be part of the subnetworks among the overall interaction network. Hence, the integration of the interaction data with the genotypic data underlying disease states facilitates the separation of the subnetworks perturbed in a given disorder from the rest of the network. In the literature, active subnetwork search is used to discover disease related regulatory pathways, dysregulated genes, functional modules, cancer markers, to classify diseases, and to predict response to treatment. In this study, a genetic algorithm based method is developed for active subnetwork search and applied to WTCCC Rheumatoid Arthritis genome-wide association study dataset. The relevance of the identified subnetworks against the disease is compared in terms of biological pathways. Our results show that the proposed method works well in detecting the significant RA associated subnetworks, and it is also applicable to recognize subnetworks of other complex diseases.","PeriodicalId":433206,"journal":{"name":"2013 8th International Symposium on Health Informatics and Bioinformatics","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125721297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-11-14DOI: 10.1109/HIBIT.2013.6661678
Y. Yuce, K. H. Gulkesen
Locating and securing an Alzheimer's patient who is outdoors and in wandering state is crucial to patient's safety. Although advances in geotracking and mobile technology have made locating patients instantly possible, reaching them while in wandering state may still take time. However, a social network of caregivers may help shorten the time that it takes to reach and secure a wandering AD patient. This study proposes a social computing application in healthcare, which is designed to form and direct a social support network of caregivers for locating and securing wandering AD patients as soon as possible. The proposed system consists of three major components; a tracking device, a middleware and a mobile application. The tracking device has a Subscriber Identity Module for Global System for Mobile Communications Network (GSM) installed on it, and is responsible for communication between an AD patient and the system (e.g. transmission of location updates in varying periods). The middleware employs a supervision mechanism to detect potentially wandering patients, a tracking mechanism to locate a wandering patient and a coordination mechanism to communicate with and direct caregivers to wandering patient. The mobile application is the mediator of the interaction (e.g. necessary communication steps to get involved in the search of a wandering patient) between a caregiver and the system during a wandering patient search session. The communication backbone of the system involves the Internet and a GSM network. The major system component, i.e. middleware, is being implemented using Java. Family caregivers will be interviewed prior to and after the use of the system. In order to find out the impact of the system in terms of depression, anxiety and burden, Center For Epidemiologic Studies Depression Scale, Patient Health Questionnaire and Zarit Burden Interview will be applied to them during these interviews respectively.
{"title":"Development of a social support intervention with a network of caregivers to find wandering Alzheimer's patients as soon as possible: A social computing application in healthcare","authors":"Y. Yuce, K. H. Gulkesen","doi":"10.1109/HIBIT.2013.6661678","DOIUrl":"https://doi.org/10.1109/HIBIT.2013.6661678","url":null,"abstract":"Locating and securing an Alzheimer's patient who is outdoors and in wandering state is crucial to patient's safety. Although advances in geotracking and mobile technology have made locating patients instantly possible, reaching them while in wandering state may still take time. However, a social network of caregivers may help shorten the time that it takes to reach and secure a wandering AD patient. This study proposes a social computing application in healthcare, which is designed to form and direct a social support network of caregivers for locating and securing wandering AD patients as soon as possible. The proposed system consists of three major components; a tracking device, a middleware and a mobile application. The tracking device has a Subscriber Identity Module for Global System for Mobile Communications Network (GSM) installed on it, and is responsible for communication between an AD patient and the system (e.g. transmission of location updates in varying periods). The middleware employs a supervision mechanism to detect potentially wandering patients, a tracking mechanism to locate a wandering patient and a coordination mechanism to communicate with and direct caregivers to wandering patient. The mobile application is the mediator of the interaction (e.g. necessary communication steps to get involved in the search of a wandering patient) between a caregiver and the system during a wandering patient search session. The communication backbone of the system involves the Internet and a GSM network. The major system component, i.e. middleware, is being implemented using Java. Family caregivers will be interviewed prior to and after the use of the system. In order to find out the impact of the system in terms of depression, anxiety and burden, Center For Epidemiologic Studies Depression Scale, Patient Health Questionnaire and Zarit Burden Interview will be applied to them during these interviews respectively.","PeriodicalId":433206,"journal":{"name":"2013 8th International Symposium on Health Informatics and Bioinformatics","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131615367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-11-14DOI: 10.1109/HIBIT.2013.6661680
U. Agyuz, S. Isci, C. Ozturk, A. Ademoglu, H. Otu
One of the main problems in systems biology is learning gene interaction networks from experimental data. This turns out to be a challenging task as the experimental data is sparse and noisy, and network learning algorithms are computationally intense. Bayesian Networks (BN) have become a popular choice for learning such networks as BNs avoid overfitting and are robust to noise. In this paper we build up on our established framework, Bayesian Network Prior, where we incorporate existing biological knowledge in learning gene interaction networks. However, biological phenomena are time-dependent and there is need to extend the static structure of learning approaches to a temporal level. Here, we present a Dynamic BN framework, which learns interaction networks between different time points in time-series data. Both intra and inter networks are learnt and compared to standard DBN learning algorithms. Our results based on synthetic and simulated gene expression data suggest that the proposed method outperforms existing approaches in identifying the underlying network structure. The proposed framework is robust to errors in the incorporated knowledge and can combine various experimental data types together with existing knowledge when learning networks.
{"title":"A dynamic Bayesian framwork to learn temporal gene interactions using external knowledge","authors":"U. Agyuz, S. Isci, C. Ozturk, A. Ademoglu, H. Otu","doi":"10.1109/HIBIT.2013.6661680","DOIUrl":"https://doi.org/10.1109/HIBIT.2013.6661680","url":null,"abstract":"One of the main problems in systems biology is learning gene interaction networks from experimental data. This turns out to be a challenging task as the experimental data is sparse and noisy, and network learning algorithms are computationally intense. Bayesian Networks (BN) have become a popular choice for learning such networks as BNs avoid overfitting and are robust to noise. In this paper we build up on our established framework, Bayesian Network Prior, where we incorporate existing biological knowledge in learning gene interaction networks. However, biological phenomena are time-dependent and there is need to extend the static structure of learning approaches to a temporal level. Here, we present a Dynamic BN framework, which learns interaction networks between different time points in time-series data. Both intra and inter networks are learnt and compared to standard DBN learning algorithms. Our results based on synthetic and simulated gene expression data suggest that the proposed method outperforms existing approaches in identifying the underlying network structure. The proposed framework is robust to errors in the incorporated knowledge and can combine various experimental data types together with existing knowledge when learning networks.","PeriodicalId":433206,"journal":{"name":"2013 8th International Symposium on Health Informatics and Bioinformatics","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131494425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-11-14DOI: 10.1109/HIBIT.2013.6661679
R. Çelebi, Ozgur Gumus, Yeşim AYDIN SON
In the life sciences, semantic web can support many aspects of bio- and health informatics, with exciting applications appearing in areas ranging from plant genetics to drug discovery. Using semantic technologies with open linked data, provides two kinds of advantages: ability to search multiple datasets through a single framework and ability to search relationships and paths of relationships that go across different datasets. The Bio2RDF project creates a network of coherently linked data across the biological databases. As part of the Bio2RDF project, an integrated bioinformatics warehouse on the semantic web is built. In this paper, a use case with a query for multiple distant data sources which are semantically available through Bio2RDF is defined. The validation of the results by traditional search techniques and discussion for future directions is presented.
{"title":"Use of open linked data in bioinformatics space: A case study","authors":"R. Çelebi, Ozgur Gumus, Yeşim AYDIN SON","doi":"10.1109/HIBIT.2013.6661679","DOIUrl":"https://doi.org/10.1109/HIBIT.2013.6661679","url":null,"abstract":"In the life sciences, semantic web can support many aspects of bio- and health informatics, with exciting applications appearing in areas ranging from plant genetics to drug discovery. Using semantic technologies with open linked data, provides two kinds of advantages: ability to search multiple datasets through a single framework and ability to search relationships and paths of relationships that go across different datasets. The Bio2RDF project creates a network of coherently linked data across the biological databases. As part of the Bio2RDF project, an integrated bioinformatics warehouse on the semantic web is built. In this paper, a use case with a query for multiple distant data sources which are semantically available through Bio2RDF is defined. The validation of the results by traditional search techniques and discussion for future directions is presented.","PeriodicalId":433206,"journal":{"name":"2013 8th International Symposium on Health Informatics and Bioinformatics","volume":"84 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126024631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-09-01DOI: 10.1109/HIBIT.2013.6661685
Muserref Duygu Saçar, J. Allmer
MicroRNAs (miRNAs) are small, non-coding RNAs which are involved in the posttranscriptional modulation of gene expression. Their short (18-24) single stranded mature sequences are involved in targeting specific genes. It turns out that experimental methods are limited and that it is difficult, if not impossible, to establish all miRNAs and their targets experimentally. Therefore, many tools for the prediction of miRNA genes and miRNA targets have been proposed. Most of these tools are based on machine learning methods and within that area mostly two-class classification is employed. Unfortunately, truly negative data is impossible to attain and only approximations of negative data are currently available. Also, we recently showed that the available positive data is not flawless. Here we investigate the impact of class imbalance on the learner accuracy and find that there is a difference of up to 50% between the best and worst precision and recall values. In addition, we looked at increasing number of features and found a curve maximizing at 0.97 recall and 0.91 precision with quickly decaying performance after inclusion of more than 100 features.
{"title":"Data mining for microrna gene prediction: On the impact of class imbalance and feature number for microrna gene prediction","authors":"Muserref Duygu Saçar, J. Allmer","doi":"10.1109/HIBIT.2013.6661685","DOIUrl":"https://doi.org/10.1109/HIBIT.2013.6661685","url":null,"abstract":"MicroRNAs (miRNAs) are small, non-coding RNAs which are involved in the posttranscriptional modulation of gene expression. Their short (18-24) single stranded mature sequences are involved in targeting specific genes. It turns out that experimental methods are limited and that it is difficult, if not impossible, to establish all miRNAs and their targets experimentally. Therefore, many tools for the prediction of miRNA genes and miRNA targets have been proposed. Most of these tools are based on machine learning methods and within that area mostly two-class classification is employed. Unfortunately, truly negative data is impossible to attain and only approximations of negative data are currently available. Also, we recently showed that the available positive data is not flawless. Here we investigate the impact of class imbalance on the learner accuracy and find that there is a difference of up to 50% between the best and worst precision and recall values. In addition, we looked at increasing number of features and found a curve maximizing at 0.97 recall and 0.91 precision with quickly decaying performance after inclusion of more than 100 features.","PeriodicalId":433206,"journal":{"name":"2013 8th International Symposium on Health Informatics and Bioinformatics","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122236302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}