Pub Date : 2010-12-01DOI: 10.1109/BIBM.2010.5706607
Feng Yang, K. Mao
Feature ranking, which ranks features via their individual importance, is one of the frequently used feature selection techniques. Traditional feature ranking criteria are apt to produce inconsistent ranking results even with light perturbations in training samples when applied to high dimensional and small-sized gene expression data. A widely used strategy for solving the inconsistencies is the multi-criterion combination. But one problem encountered in combining multiple criteria is the score normalization. In this paper, problems in existing methods are first analyzed, and a new gene importance transformation algorithm is then proposed. Experimental studies on three popular gene expression datasets show that the multi-criterion combination based on the proposed score correction and normalization produces gene rankings with improved robustness.
{"title":"Improving robustness of gene ranking by resampling and permutation based score correction and normalization","authors":"Feng Yang, K. Mao","doi":"10.1109/BIBM.2010.5706607","DOIUrl":"https://doi.org/10.1109/BIBM.2010.5706607","url":null,"abstract":"Feature ranking, which ranks features via their individual importance, is one of the frequently used feature selection techniques. Traditional feature ranking criteria are apt to produce inconsistent ranking results even with light perturbations in training samples when applied to high dimensional and small-sized gene expression data. A widely used strategy for solving the inconsistencies is the multi-criterion combination. But one problem encountered in combining multiple criteria is the score normalization. In this paper, problems in existing methods are first analyzed, and a new gene importance transformation algorithm is then proposed. Experimental studies on three popular gene expression datasets show that the multi-criterion combination based on the proposed score correction and normalization produces gene rankings with improved robustness.","PeriodicalId":275098,"journal":{"name":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127649056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Breast cancer is the most common malignant disease in women. Mammographic mass retrieval system can help radiologists to improve the diagnostic accuracy by retrieving biopsy-proven masses which are similar with the diagnostic ones. However, although screening mammograms usually consists of two-view(MLO and CC) mammography of the same breast, most breast CAD systems incorporate with image retrieval techniques are based on a single-view principle where query ROI within a view is analyzed independently. In this paper, a mammographic mass retrieval approach based on multi-view information is proposed. In this work, the query example is a multi-view(MLO and CC) mass pair instead of the single view mass in the traditional image retrieval framework. In the experiments, several visual features are used for retrieval evaluation. Both distance similarity measures, such as Euclidean distance, and k-NN regression model based non-distance similarity measures are used for comparison. Experimental study was carried out on a database with 126 biopsy-proven masses(63 mass pairs). Preliminary results showed that multi-view based retrieval approach achieves better retrieval accuracy than single-view based one, especially for the k-NN regression model based similairy metric.
{"title":"Improved mammographic mass retrieval performance using multi-view information","authors":"Wei Liu, Weidong Xu, Lihua Li, Shuang Li, Huanping Zhao, Juan Zhang","doi":"10.1109/BIBM.2010.5706601","DOIUrl":"https://doi.org/10.1109/BIBM.2010.5706601","url":null,"abstract":"Breast cancer is the most common malignant disease in women. Mammographic mass retrieval system can help radiologists to improve the diagnostic accuracy by retrieving biopsy-proven masses which are similar with the diagnostic ones. However, although screening mammograms usually consists of two-view(MLO and CC) mammography of the same breast, most breast CAD systems incorporate with image retrieval techniques are based on a single-view principle where query ROI within a view is analyzed independently. In this paper, a mammographic mass retrieval approach based on multi-view information is proposed. In this work, the query example is a multi-view(MLO and CC) mass pair instead of the single view mass in the traditional image retrieval framework. In the experiments, several visual features are used for retrieval evaluation. Both distance similarity measures, such as Euclidean distance, and k-NN regression model based non-distance similarity measures are used for comparison. Experimental study was carried out on a database with 126 biopsy-proven masses(63 mass pairs). Preliminary results showed that multi-view based retrieval approach achieves better retrieval accuracy than single-view based one, especially for the k-NN regression model based similairy metric.","PeriodicalId":275098,"journal":{"name":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115430883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-01DOI: 10.1109/BIBM.2010.5706542
Zhengkui Wang, Yue Wang, K. Tan, L. Wong, D. Agrawal
The 1000 Genome project has made available a large number of single nucleotide polymorphisms (SNPs) for genome-wide association studies (GWAS). However, the large number of SNPs has also rendered the discovery of epistatic interactions of SNPs computationally expensive. Parallelizing the computation offers a promising solution. In this paper, we propose a cloud-based epistasis computing (CEO) model that examines all k-locus SNPs combinations to find statistically significant epistatic interactions efficiently. Our CEO model uses the MapReduce framework which can be executed both on user's own clusters or on a cloud environment. Our cloud-based solution offers elastic computing resources to users, and more importantly, makes our approach affordable and available to all end-users. We evaluate our CEO model on a cluster of more than 40 nodes. Our experiment results show that our CEO model is computationally flexible, scalable and practical.
{"title":"CEO a cloud epistasis computing model in GWAS","authors":"Zhengkui Wang, Yue Wang, K. Tan, L. Wong, D. Agrawal","doi":"10.1109/BIBM.2010.5706542","DOIUrl":"https://doi.org/10.1109/BIBM.2010.5706542","url":null,"abstract":"The 1000 Genome project has made available a large number of single nucleotide polymorphisms (SNPs) for genome-wide association studies (GWAS). However, the large number of SNPs has also rendered the discovery of epistatic interactions of SNPs computationally expensive. Parallelizing the computation offers a promising solution. In this paper, we propose a cloud-based epistasis computing (CEO) model that examines all k-locus SNPs combinations to find statistically significant epistatic interactions efficiently. Our CEO model uses the MapReduce framework which can be executed both on user's own clusters or on a cloud environment. Our cloud-based solution offers elastic computing resources to users, and more importantly, makes our approach affordable and available to all end-users. We evaluate our CEO model on a cluster of more than 40 nodes. Our experiment results show that our CEO model is computationally flexible, scalable and practical.","PeriodicalId":275098,"journal":{"name":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126593575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-01DOI: 10.1109/BIBM.2010.5706560
Nan Zhao, Bin Pang, C. Shyu, Dmitry Korkin
The progress in experimental and computational structural biology has led to a rapid growth of experimentally resolved structures and computational models of proteinprotein interactions. However, distinguishing between the physiological and non-physiological interactions remains a challenging problem. In this work, two related problems of interface classification have been addressed. The first problem is concerned with classification of the physiological and crystal-packing interactions. The second problem deals with the classification of the physiological interactions, or their accurate models, and decoys obtained from the inaccurate docking models. We have defined a universal set of interface features and employed supervised and semi-supervised learning approaches to accurately classify the interactions in both problems. Furthermore, we formulated the second problem as a semi-supervised learning problem and employed a transductive SVM to improve the accuracy of classification. Finally, we showed that using the scoring functions from the obtained classifiers, one can improve the accuracy of the docking methods.
{"title":"An accurate classification of native and non-native protein-protein interactions using supervised and semi-supervised learning approaches","authors":"Nan Zhao, Bin Pang, C. Shyu, Dmitry Korkin","doi":"10.1109/BIBM.2010.5706560","DOIUrl":"https://doi.org/10.1109/BIBM.2010.5706560","url":null,"abstract":"The progress in experimental and computational structural biology has led to a rapid growth of experimentally resolved structures and computational models of proteinprotein interactions. However, distinguishing between the physiological and non-physiological interactions remains a challenging problem. In this work, two related problems of interface classification have been addressed. The first problem is concerned with classification of the physiological and crystal-packing interactions. The second problem deals with the classification of the physiological interactions, or their accurate models, and decoys obtained from the inaccurate docking models. We have defined a universal set of interface features and employed supervised and semi-supervised learning approaches to accurately classify the interactions in both problems. Furthermore, we formulated the second problem as a semi-supervised learning problem and employed a transductive SVM to improve the accuracy of classification. Finally, we showed that using the scoring functions from the obtained classifiers, one can improve the accuracy of the docking methods.","PeriodicalId":275098,"journal":{"name":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127137279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-01DOI: 10.1109/BIBM.2010.5706536
Roberto Blanco
Computational phylogenetics has historically neglected strict theoretical approaches that exploit the mathematical models beneath which it abstracts away the nuances of evolution. In particular, parsimony is conceptually simple and amenable to rigorous treatment, and has a clear analogue in graph theory, the Steiner tree. We present and refine the notion of sequence space as the soil from which all graph-theoretical methods arise, studying its structural properties and complexity with an eye on maximum parsimony. We therefrom introduce a basic set of very efficient implicit reductions that discard information with a fixed effect on the optimality of the solution, and show how it can be applied to large, real datasets.
{"title":"Structural parsimony: Reductions in sequence space","authors":"Roberto Blanco","doi":"10.1109/BIBM.2010.5706536","DOIUrl":"https://doi.org/10.1109/BIBM.2010.5706536","url":null,"abstract":"Computational phylogenetics has historically neglected strict theoretical approaches that exploit the mathematical models beneath which it abstracts away the nuances of evolution. In particular, parsimony is conceptually simple and amenable to rigorous treatment, and has a clear analogue in graph theory, the Steiner tree. We present and refine the notion of sequence space as the soil from which all graph-theoretical methods arise, studying its structural properties and complexity with an eye on maximum parsimony. We therefrom introduce a basic set of very efficient implicit reductions that discard information with a fixed effect on the optimality of the solution, and show how it can be applied to large, real datasets.","PeriodicalId":275098,"journal":{"name":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126237688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-01DOI: 10.1109/BIBM.2010.5706637
Jiarui Ding, Sohrab P. Shah
As an extension to hidden Markov models, the hidden semi-Markov models allow the probability distribution of staying in the same state to be a general distribution. Therefore, hidden semi-Markov models are good at modeling sequences with succession of homogenous zones by choosing appropriate state duration distributions. Hidden semi-Markov models are generative models. Most times they are trained by maximum likelihood estimation. To compensate model mis-specification and provide protection against outliers, hidden semi-Markov models can be trained discriminatively given a labeled training set at the expense of increased training complexity. As an alternative to discriminative training, in this paper, we consider model mis-specification and outliers by adopting robust methods. Specifically, we use Student's t mixture models as the emission distributions of hidden semi-Markov models. The proposed robust hidden semi-Markov models are used to model array based comparative genomic hybridization data. Experiments conducted on the benchmark data from the Coriell cell lines, and the glioblastoma multiforme data illustrate the reliability of the technique.
{"title":"Robust hidden semi-Markov modeling of array CGH data","authors":"Jiarui Ding, Sohrab P. Shah","doi":"10.1109/BIBM.2010.5706637","DOIUrl":"https://doi.org/10.1109/BIBM.2010.5706637","url":null,"abstract":"As an extension to hidden Markov models, the hidden semi-Markov models allow the probability distribution of staying in the same state to be a general distribution. Therefore, hidden semi-Markov models are good at modeling sequences with succession of homogenous zones by choosing appropriate state duration distributions. Hidden semi-Markov models are generative models. Most times they are trained by maximum likelihood estimation. To compensate model mis-specification and provide protection against outliers, hidden semi-Markov models can be trained discriminatively given a labeled training set at the expense of increased training complexity. As an alternative to discriminative training, in this paper, we consider model mis-specification and outliers by adopting robust methods. Specifically, we use Student's t mixture models as the emission distributions of hidden semi-Markov models. The proposed robust hidden semi-Markov models are used to model array based comparative genomic hybridization data. Experiments conducted on the benchmark data from the Coriell cell lines, and the glioblastoma multiforme data illustrate the reliability of the technique.","PeriodicalId":275098,"journal":{"name":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125862815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-01DOI: 10.1109/BIBM.2010.5706548
M. Mak, Wei Wang, S. Kung
We have recently found that the computation time of homology-based subcellular localization can be substantially reduced by aligning profiles up to the cleavage site positions of signal peptides, mitochondrial targeting peptides, and chloro-plast transit peptides [1]. While the method can reduce the profile alignment time by as much as 20 folds, it cannot reduce the computation time spent on creating the profiles. In this paper, we propose a new approach that can reduce both the profile creation time and profile alignment time. In the new approach, instead of cutting the profiles, we shorten the sequences by cutting them at the cleavage site locations. The shortened sequences are then presented to PSI-BLAST to compute the profiles. Experimental results and analysis of profile-alignment score matrices suggest that both profile creation time and profile alignment time can be reduced without sacrificing subcellular localization accuracy. Once a pairwise profile-alignment score matrix has been obtained, a one-vs-rest SVM classifier can be trained. To further reduce the training and recognition time of the classifier, we propose a perturbation discriminant analysis (PDA) technique. It was found that PDA enjoys a short training time as compared to the conventional SVM.
{"title":"Truncation of protein sequences for fast profile alignment with application to subcellular localization","authors":"M. Mak, Wei Wang, S. Kung","doi":"10.1109/BIBM.2010.5706548","DOIUrl":"https://doi.org/10.1109/BIBM.2010.5706548","url":null,"abstract":"We have recently found that the computation time of homology-based subcellular localization can be substantially reduced by aligning profiles up to the cleavage site positions of signal peptides, mitochondrial targeting peptides, and chloro-plast transit peptides [1]. While the method can reduce the profile alignment time by as much as 20 folds, it cannot reduce the computation time spent on creating the profiles. In this paper, we propose a new approach that can reduce both the profile creation time and profile alignment time. In the new approach, instead of cutting the profiles, we shorten the sequences by cutting them at the cleavage site locations. The shortened sequences are then presented to PSI-BLAST to compute the profiles. Experimental results and analysis of profile-alignment score matrices suggest that both profile creation time and profile alignment time can be reduced without sacrificing subcellular localization accuracy. Once a pairwise profile-alignment score matrix has been obtained, a one-vs-rest SVM classifier can be trained. To further reduce the training and recognition time of the classifier, we propose a perturbation discriminant analysis (PDA) technique. It was found that PDA enjoys a short training time as compared to the conventional SVM.","PeriodicalId":275098,"journal":{"name":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"396 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125924962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-01DOI: 10.1109/BIBM.2010.5706535
M. Gromiha, N. Saranya, S. Selvaraj, B. Jayaram, K. Fukui
We have developed an energy based approach for identifying the binding site residues in protein-protein complexes. The binding site residues have been analyzed with sequence and structure based parameters such as neighboring residues in the vicinity of binding sites and conformational switching. We observed specific preferences of dipeptides and tripeptides for binding, which is unique to proteinprotein complexes. Our analysis showed that 7% of residues changed their conformations upon proteinprotein complex formation and it is 9.2% and 6.6% in the binding and non-binding sites, respectively. Specifically, the residues Glu, Lys, Leu and Ser changed their conformation from coil to helix/strand and from helix to coil/strand. Leu, Ser, Thr and Val prefer to change their conformation from strand to coil/helix. The results obtained in this study will be helpful for understanding and predicting the binding sites in protein-protein complexes.
{"title":"Sequence and structural features of binding site residues in protein-protein complexes","authors":"M. Gromiha, N. Saranya, S. Selvaraj, B. Jayaram, K. Fukui","doi":"10.1109/BIBM.2010.5706535","DOIUrl":"https://doi.org/10.1109/BIBM.2010.5706535","url":null,"abstract":"We have developed an energy based approach for identifying the binding site residues in protein-protein complexes. The binding site residues have been analyzed with sequence and structure based parameters such as neighboring residues in the vicinity of binding sites and conformational switching. We observed specific preferences of dipeptides and tripeptides for binding, which is unique to proteinprotein complexes. Our analysis showed that 7% of residues changed their conformations upon proteinprotein complex formation and it is 9.2% and 6.6% in the binding and non-binding sites, respectively. Specifically, the residues Glu, Lys, Leu and Ser changed their conformation from coil to helix/strand and from helix to coil/strand. Leu, Ser, Thr and Val prefer to change their conformation from strand to coil/helix. The results obtained in this study will be helpful for understanding and predicting the binding sites in protein-protein complexes.","PeriodicalId":275098,"journal":{"name":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128017502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-01DOI: 10.1109/BIBM.2010.5706644
Myungha Jang, A. Rhie, Hyun Seok Park
Graphical layout techniques serve a vital part in systems biology to enhance understanding and visualization of chemical reaction pathways in our body. Metabolic networks have particularly complex binding structures, making its graphical representation challenging to comprehend. For the purpose of legibility, reducing graph complexity in metabolic networks is crucial when working with large number of nodes and edges. This paper introduces a node abstraction algorithm that treats metabolic pathways as hierarchical networks and considers reactions between compound pairs-the equivalent of node pairs in the context of biological networks-as an elastic parameter for reaction compression in an automated way. Substrates and products that locally compose reactions with low connectivity were reduced, and cyclical or hierarchical pathways were aligned according to their structural composition.
{"title":"Toward automatically drawn metabolic pathway atlas with peripheral node abstraction algorithm","authors":"Myungha Jang, A. Rhie, Hyun Seok Park","doi":"10.1109/BIBM.2010.5706644","DOIUrl":"https://doi.org/10.1109/BIBM.2010.5706644","url":null,"abstract":"Graphical layout techniques serve a vital part in systems biology to enhance understanding and visualization of chemical reaction pathways in our body. Metabolic networks have particularly complex binding structures, making its graphical representation challenging to comprehend. For the purpose of legibility, reducing graph complexity in metabolic networks is crucial when working with large number of nodes and edges. This paper introduces a node abstraction algorithm that treats metabolic pathways as hierarchical networks and considers reactions between compound pairs-the equivalent of node pairs in the context of biological networks-as an elastic parameter for reaction compression in an automated way. Substrates and products that locally compose reactions with low connectivity were reduced, and cyclical or hierarchical pathways were aligned according to their structural composition.","PeriodicalId":275098,"journal":{"name":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127911880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-01DOI: 10.1109/BIBM.2010.5706614
Chao Zhang, Shunfu Xu, Dong Xu
As a marker of Helicobacter pylori, Cytotoxin-associated gene A (CagA) has been revealed to be the major virulence factor to cause gastroduodenal diseases. However, the molecular mechanisms that underlie the development of different gastroduodenal diseases caused by cagA-positive H. pylori infection remain unknown. Current studies are mainly limited to the relationship between EPIYA motifs in the CagA strain and diseases, but such a relationship is insufficient to explain the diversity of diseases. We propose a new and systematic method to analyze the relationship between the whole CagA sequence patterns and diseases. For this purpose, we introduced entropy calculation to detect key residues of CagA as the gastric cancer biomarkers, and then employed a supervised learning procedure to classify the cancer and non-cancer related CagA strains by using the key residues. We achieved 76% and 71% classification accuracy for Western and East Asian subtypes, respectively. Our study may help establish H. pylori biomarkers for predicting gastroduodenal disease outcome.
细胞毒素相关基因a (Cytotoxin-associated gene a, CagA)作为幽门螺杆菌的标志物,是引起胃十二指肠疾病的主要毒力因子。然而,由caga阳性幽门螺杆菌感染引起的不同胃十二指肠疾病发展的分子机制尚不清楚。目前的研究主要局限于CagA菌株中EPIYA基序与疾病的关系,但这种关系不足以解释疾病的多样性。我们提出了一种新的系统的方法来分析整个CagA序列模式与疾病之间的关系。为此,我们引入熵计算来检测CagA关键残基作为胃癌生物标志物,然后利用关键残基采用监督学习方法对胃癌和非癌症相关的CagA菌株进行分类。我们对西亚和东亚亚型的分类准确率分别达到76%和71%。我们的研究可能有助于建立预测胃十二指肠疾病预后的幽门螺杆菌生物标志物。
{"title":"Detection and application of CagA sequence markers for assessing risk factor of gastric cancer caused by Helicobacter pylori","authors":"Chao Zhang, Shunfu Xu, Dong Xu","doi":"10.1109/BIBM.2010.5706614","DOIUrl":"https://doi.org/10.1109/BIBM.2010.5706614","url":null,"abstract":"As a marker of Helicobacter pylori, Cytotoxin-associated gene A (CagA) has been revealed to be the major virulence factor to cause gastroduodenal diseases. However, the molecular mechanisms that underlie the development of different gastroduodenal diseases caused by cagA-positive H. pylori infection remain unknown. Current studies are mainly limited to the relationship between EPIYA motifs in the CagA strain and diseases, but such a relationship is insufficient to explain the diversity of diseases. We propose a new and systematic method to analyze the relationship between the whole CagA sequence patterns and diseases. For this purpose, we introduced entropy calculation to detect key residues of CagA as the gastric cancer biomarkers, and then employed a supervised learning procedure to classify the cancer and non-cancer related CagA strains by using the key residues. We achieved 76% and 71% classification accuracy for Western and East Asian subtypes, respectively. Our study may help establish H. pylori biomarkers for predicting gastroduodenal disease outcome.","PeriodicalId":275098,"journal":{"name":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130478213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}