首页 > 最新文献

International Journal of Data Mining and Bioinformatics最新文献

英文 中文
A fast Boyer-Moore type pattern matching algorithm for highly similar sequences 高度相似序列的快速Boyer-Moore模式匹配算法
IF 0.3 4区 生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2015-09-01 DOI: 10.1504/IJDMB.2015.072101
Nadia Ben Nsira, T. Lecroq, M. Elloumi
In the last decade, biology and medicine have undergone a fundamental change: next generation sequencing (NGS) technologies have enabled to obtain genomic sequences very quickly and at small costs compared to the traditional Sanger method. These NGS technologies have thus permitted to collect genomic sequences (genes, exomes or even full genomes) of individuals of the same species. These latter sequences are identical to more than 99%. There is thus a strong need for efficient algorithms for indexing and performing fast pattern matching in such specific sets of sequences. In this paper we propose a very efficient algorithm that solves the exact pattern matching problem in a set of highly similar DNA sequences where only the pattern can be pre-processed. This new algorithm extends variants of the Boyer-Moore exact string matching algorithm. Experimental results show that it exhibits the best performances in practice.
在过去的十年中,生物学和医学经历了根本性的变化:与传统的Sanger方法相比,下一代测序(NGS)技术能够以非常快的速度和低成本获得基因组序列。因此,这些NGS技术可以收集同一物种个体的基因组序列(基因、外显子组甚至全基因组)。后面这些序列99%以上是相同的。因此,迫切需要高效的算法来对这些特定的序列集进行索引和执行快速模式匹配。在本文中,我们提出了一种非常有效的算法来解决一组高度相似的DNA序列中只有模式可以预处理的精确模式匹配问题。这个新算法扩展了Boyer-Moore精确字符串匹配算法的变体。实验结果表明,该方法在实际应用中具有较好的性能。
{"title":"A fast Boyer-Moore type pattern matching algorithm for highly similar sequences","authors":"Nadia Ben Nsira, T. Lecroq, M. Elloumi","doi":"10.1504/IJDMB.2015.072101","DOIUrl":"https://doi.org/10.1504/IJDMB.2015.072101","url":null,"abstract":"In the last decade, biology and medicine have undergone a fundamental change: next generation sequencing (NGS) technologies have enabled to obtain genomic sequences very quickly and at small costs compared to the traditional Sanger method. These NGS technologies have thus permitted to collect genomic sequences (genes, exomes or even full genomes) of individuals of the same species. These latter sequences are identical to more than 99%. There is thus a strong need for efficient algorithms for indexing and performing fast pattern matching in such specific sets of sequences. In this paper we propose a very efficient algorithm that solves the exact pattern matching problem in a set of highly similar DNA sequences where only the pattern can be pre-processed. This new algorithm extends variants of the Boyer-Moore exact string matching algorithm. Experimental results show that it exhibits the best performances in practice.","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":"45 1","pages":"266-88"},"PeriodicalIF":0.3,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/IJDMB.2015.072101","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"66730683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Cuckoo search optimisation for feature selection in cancer classification: a new approach 杜鹃搜索优化在癌症分类中的特征选择:一种新方法
IF 0.3 4区 生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2015-09-01 DOI: 10.1504/IJDMB.2015.072092
C. Gunavathi, K. Premalatha
Cuckoo Search (CS) optimisation algorithm is used for feature selection in cancer classification using microarray gene expression data. Since the gene expression data has thousands of genes and a small number of samples, feature selection methods can be used for the selection of informative genes to improve the classification accuracy. Initially, the genes are ranked based on T-statistics, Signal-to-Noise Ratio (SNR) and F-statistics values. The CS is used to find the informative genes from the top-m ranked genes. The classification accuracy of k-Nearest Neighbour (kNN) technique is used as the fitness function for CS. The proposed method is experimented and analysed with ten different cancer gene expression datasets. The results show that the CS gives 100% average accuracy for DLBCL Harvard, Lung Michigan, Ovarian Cancer, AML-ALL and Lung Harvard2 datasets and it outperforms the existing techniques in DLBCL outcome and prostate datasets.
利用微阵列基因表达数据,采用布谷鸟搜索(CS)优化算法进行肿瘤分类特征选择。由于基因表达数据有数千个基因,样本数量少,因此可以使用特征选择方法来选择信息量大的基因,以提高分类精度。首先,根据t统计量、信噪比(SNR)和f统计值对基因进行排序。CS用于从排名前m位的基因中寻找信息基因。使用k-最近邻(kNN)技术的分类精度作为CS的适应度函数。用十种不同的癌症基因表达数据集对该方法进行了实验和分析。结果表明,CS对DLBCL Harvard、Lung Michigan、Ovarian Cancer、AML-ALL和Lung Harvard2数据集的平均准确率为100%,并且在DLBCL结局和前列腺数据集上优于现有技术。
{"title":"Cuckoo search optimisation for feature selection in cancer classification: a new approach","authors":"C. Gunavathi, K. Premalatha","doi":"10.1504/IJDMB.2015.072092","DOIUrl":"https://doi.org/10.1504/IJDMB.2015.072092","url":null,"abstract":"Cuckoo Search (CS) optimisation algorithm is used for feature selection in cancer classification using microarray gene expression data. Since the gene expression data has thousands of genes and a small number of samples, feature selection methods can be used for the selection of informative genes to improve the classification accuracy. Initially, the genes are ranked based on T-statistics, Signal-to-Noise Ratio (SNR) and F-statistics values. The CS is used to find the informative genes from the top-m ranked genes. The classification accuracy of k-Nearest Neighbour (kNN) technique is used as the fitness function for CS. The proposed method is experimented and analysed with ten different cancer gene expression datasets. The results show that the CS gives 100% average accuracy for DLBCL Harvard, Lung Michigan, Ovarian Cancer, AML-ALL and Lung Harvard2 datasets and it outperforms the existing techniques in DLBCL outcome and prostate datasets.","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":"13 3 1","pages":"248-65"},"PeriodicalIF":0.3,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/IJDMB.2015.072092","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"66730669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
PMCR-Miner: parallel maximal confident association rules miner algorithm for microarray data set 微阵列数据集并行最大置信度关联规则挖掘算法
IF 0.3 4区 生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2015-09-01 DOI: 10.1504/IJDMB.2015.072091
Wael Zakaria Abd Allah, Y. Kotb, F. Ghaleb
The MCR-Miner algorithm is aimed to mine all maximal high confident association rules form the microarray up/down-expressed genes data set. This paper introduces two new algorithms: IMCR-Miner and PMCR-Miner. The IMCR-Miner algorithm is an extension of the MCR-Miner algorithm with some improvements. These improvements implement a novel way to store the samples of each gene into a list of unsigned integers in order to benefit using the bitwise operations. In addition, the IMCR-Miner algorithm overcomes the drawbacks faced by the MCR-Miner algorithm by setting some restrictions to ignore repeated comparisons. The PMCR-Miner algorithm is a parallel version of the new proposed IMCR-Miner algorithm. The PMCR-Miner algorithm is based on shared-memory systems and task parallelism, where no time is needed in the process of sharing and combining data between processors. The experimental results on real microarray data sets show that the PMCR-Miner algorithm is more efficient and scalable than the counterparts.
MCR-Miner算法旨在从微阵列上/下表达基因数据集中挖掘所有最大的高置信度关联规则。本文介绍了两种新的算法:IMCR-Miner和PMCR-Miner。IMCR-Miner算法是对MCR-Miner算法的扩展,并做了一些改进。这些改进实现了一种新颖的方法,将每个基因的样本存储到一个无符号整数列表中,以便使用按位操作。此外,IMCR-Miner算法通过设置一些忽略重复比较的限制,克服了MCR-Miner算法所面临的缺点。PMCR-Miner算法是新提出的IMCR-Miner算法的并行版本。PMCR-Miner算法基于共享内存系统和任务并行性,在处理器之间共享和组合数据的过程中不需要时间。在实际微阵列数据集上的实验结果表明,PMCR-Miner算法比同类算法具有更高的效率和可扩展性。
{"title":"PMCR-Miner: parallel maximal confident association rules miner algorithm for microarray data set","authors":"Wael Zakaria Abd Allah, Y. Kotb, F. Ghaleb","doi":"10.1504/IJDMB.2015.072091","DOIUrl":"https://doi.org/10.1504/IJDMB.2015.072091","url":null,"abstract":"The MCR-Miner algorithm is aimed to mine all maximal high confident association rules form the microarray up/down-expressed genes data set. This paper introduces two new algorithms: IMCR-Miner and PMCR-Miner. The IMCR-Miner algorithm is an extension of the MCR-Miner algorithm with some improvements. These improvements implement a novel way to store the samples of each gene into a list of unsigned integers in order to benefit using the bitwise operations. In addition, the IMCR-Miner algorithm overcomes the drawbacks faced by the MCR-Miner algorithm by setting some restrictions to ignore repeated comparisons. The PMCR-Miner algorithm is a parallel version of the new proposed IMCR-Miner algorithm. The PMCR-Miner algorithm is based on shared-memory systems and task parallelism, where no time is needed in the process of sharing and combining data between processors. The experimental results on real microarray data sets show that the PMCR-Miner algorithm is more efficient and scalable than the counterparts.","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":"13 3 1","pages":"225-47"},"PeriodicalIF":0.3,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/IJDMB.2015.072091","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"66730652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Towards rule-based metabolic databases: a requirement analysis based on KEGG 基于规则的代谢数据库:基于KEGG的需求分析
IF 0.3 4区 生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2015-09-01 DOI: 10.1504/IJDMB.2015.072103
S. Richter, I. Fetzer, M. Thullner, F. Centler, P. Dittrich
Knowledge of metabolic processes is collected in easily accessable online databases which are increasing rapidly in content and detail. Using these databases for the automatic construction of metabolic network models requires high accuracy and consistency. In this bipartite study we evaluate current accuracy and consistency problems using the KEGG database as a prominent example and propose design principles for dealing with such problems. In the first half, we present our computational approach for classifying inconsistencies and provide an overview of the classes of inconsistencies we identified. We detected inconsistencies both for database entries referring to substances and entries referring to reactions. In the second part, we present strategies to deal with the detected problem classes. We especially propose a rule-based database approach which allows for the inclusion of parameterised molecular species and parameterised reactions. Detailed case-studies and a comparison of explicit networks from KEGG with their anticipated rule-based representation underline the applicability and scalability of this approach.
代谢过程的知识收集在易于访问的在线数据库中,这些数据库的内容和细节正在迅速增加。利用这些数据库自动构建代谢网络模型需要较高的准确性和一致性。在这个分两部分的研究中,我们以KEGG数据库为例,评估了当前的准确性和一致性问题,并提出了处理这些问题的设计原则。在前半部分,我们介绍了对不一致进行分类的计算方法,并概述了我们发现的不一致的类别。我们检测到涉及物质的数据库条目和涉及反应的数据库条目不一致。在第二部分中,我们提出了处理检测到的问题类的策略。我们特别提出了一种基于规则的数据库方法,该方法允许包含参数化分子物种和参数化反应。详细的案例研究和KEGG的显式网络与其预期的基于规则的表示的比较强调了这种方法的适用性和可扩展性。
{"title":"Towards rule-based metabolic databases: a requirement analysis based on KEGG","authors":"S. Richter, I. Fetzer, M. Thullner, F. Centler, P. Dittrich","doi":"10.1504/IJDMB.2015.072103","DOIUrl":"https://doi.org/10.1504/IJDMB.2015.072103","url":null,"abstract":"Knowledge of metabolic processes is collected in easily accessable online databases which are increasing rapidly in content and detail. Using these databases for the automatic construction of metabolic network models requires high accuracy and consistency. In this bipartite study we evaluate current accuracy and consistency problems using the KEGG database as a prominent example and propose design principles for dealing with such problems. In the first half, we present our computational approach for classifying inconsistencies and provide an overview of the classes of inconsistencies we identified. We detected inconsistencies both for database entries referring to substances and entries referring to reactions. In the second part, we present strategies to deal with the detected problem classes. We especially propose a rule-based database approach which allows for the inclusion of parameterised molecular species and parameterised reactions. Detailed case-studies and a comparison of explicit networks from KEGG with their anticipated rule-based representation underline the applicability and scalability of this approach.","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":"13 3 1","pages":"289-319"},"PeriodicalIF":0.3,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/IJDMB.2015.072103","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"66730736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Orthogonal projection correction for confounders in biological data classification 生物数据分类中混杂因素的正交投影校正
IF 0.3 4区 生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2015-08-01 DOI: 10.1504/IJDMB.2015.071553
Limin Li, Shuqin Zhang
The existence of confounders such as population structure in genome-wide association study makes it difficult to apply machine learning methods directly to solve biological problems. It is still unclear how to effectively correct confounders. In this work, we propose an Orthogonal Projection Correction (OPC) method to correct confounders. This is achieved by orthogonally decomposing each feature to a confounding component and a non-confounding component, such that the original data can be best reconstructed by only the non-confounding components of features. The confounder space is built based on prior knowledge, and each feature is projected to its orthogonal complement space. This OPC procedure is shown to be kernelisable. We then propose a ProSVM method by integrating the OPC method and support vector machine for classification. In the experiments, our OPC method for confounder correction improves the tumour diagnosis based on samples from different labs and phenotype prediction in the presence of population structure.
在全基因组关联研究中,由于群体结构等混杂因素的存在,使得机器学习方法难以直接应用于解决生物学问题。目前还不清楚如何有效地纠正混杂因素。在这项工作中,我们提出了一种正交投影校正(OPC)方法来校正混杂。这是通过将每个特征正交分解为一个混杂成分和一个非混杂成分来实现的,这样只有特征的非混杂成分才能最好地重建原始数据。基于先验知识构建混杂空间,并将每个特征投影到其正交补空间。这个OPC程序是可内核化的。然后,我们提出了一种结合OPC方法和支持向量机进行分类的provm方法。在实验中,我们用于混杂校正的OPC方法改进了基于不同实验室样本的肿瘤诊断和存在群体结构的表型预测。
{"title":"Orthogonal projection correction for confounders in biological data classification","authors":"Limin Li, Shuqin Zhang","doi":"10.1504/IJDMB.2015.071553","DOIUrl":"https://doi.org/10.1504/IJDMB.2015.071553","url":null,"abstract":"The existence of confounders such as population structure in genome-wide association study makes it difficult to apply machine learning methods directly to solve biological problems. It is still unclear how to effectively correct confounders. In this work, we propose an Orthogonal Projection Correction (OPC) method to correct confounders. This is achieved by orthogonally decomposing each feature to a confounding component and a non-confounding component, such that the original data can be best reconstructed by only the non-confounding components of features. The confounder space is built based on prior knowledge, and each feature is projected to its orthogonal complement space. This OPC procedure is shown to be kernelisable. We then propose a ProSVM method by integrating the OPC method and support vector machine for classification. In the experiments, our OPC method for confounder correction improves the tumour diagnosis based on samples from different labs and phenotype prediction in the presence of population structure.","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":"13 2 1","pages":"181-96"},"PeriodicalIF":0.3,"publicationDate":"2015-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/IJDMB.2015.071553","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"66730558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
miRNA target recognition using features of suboptimal alignments 利用次优比对特征识别miRNA目标
IF 0.3 4区 生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2015-08-01 DOI: 10.1504/IJDMB.2015.071523
Ali Katanforoush, Ehsan Mahdavi
MicroRNAs (miRNAs) are a class of short RNA molecules that regulate gene expression by binding directly to messenger RNAs. Conventional approaches to miRNA target prediction estimate the accessibility of target sites and the strength of the binding miRNA by finding optimums of some energy models, which involves O(n3) computations. Alternatively, we narrow down potential binding sites of miRNAs to suboptimal hits of a pairwise alignment algorithm called Fitting Alignment in O(n2). We invoke a same algorithm, once for all candidate sites to measure the site accessibilities. These features are applied to a binary classifier being learned to predict true associations between miRNAs and target genes. Training the classifier requires the negative samples indicating non-affected genes. The experiments verifying such negative associations have been rarely performed, so we exploit tissue-specific gene expression data to impute the negative associations. The recall rate of our method is above 70% (at precision 85%).
MicroRNAs (miRNAs)是一类通过直接结合信使RNA来调节基因表达的短RNA分子。传统的miRNA靶点预测方法是通过寻找一些能量模型的最优值来估计靶点的可达性和结合miRNA的强度,这涉及到O(n3)的计算。或者,我们将mirna的潜在结合位点缩小到称为O(n2)拟合比对的配对比对算法的次优命中。我们对所有候选站点调用一次相同的算法来度量站点的可访问性。这些特征被应用于正在学习的二元分类器,以预测mirna和目标基因之间的真实关联。训练分类器需要负样本表示未受影响的基因。验证这种负关联的实验很少进行,因此我们利用组织特异性基因表达数据来推断负关联。该方法的查全率在70%以上(查准率85%)。
{"title":"miRNA target recognition using features of suboptimal alignments","authors":"Ali Katanforoush, Ehsan Mahdavi","doi":"10.1504/IJDMB.2015.071523","DOIUrl":"https://doi.org/10.1504/IJDMB.2015.071523","url":null,"abstract":"MicroRNAs (miRNAs) are a class of short RNA molecules that regulate gene expression by binding directly to messenger RNAs. Conventional approaches to miRNA target prediction estimate the accessibility of target sites and the strength of the binding miRNA by finding optimums of some energy models, which involves O(n3) computations. Alternatively, we narrow down potential binding sites of miRNAs to suboptimal hits of a pairwise alignment algorithm called Fitting Alignment in O(n2). We invoke a same algorithm, once for all candidate sites to measure the site accessibilities. These features are applied to a binary classifier being learned to predict true associations between miRNAs and target genes. Training the classifier requires the negative samples indicating non-affected genes. The experiments verifying such negative associations have been rarely performed, so we exploit tissue-specific gene expression data to impute the negative associations. The recall rate of our method is above 70% (at precision 85%).","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":"13 2 1","pages":"171-80"},"PeriodicalIF":0.3,"publicationDate":"2015-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/IJDMB.2015.071523","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"66730401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Exploiting multi-layered vector spaces for signal peptide detection 利用多层向量空间进行信号肽检测
IF 0.3 4区 生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2015-08-01 DOI: 10.1504/IJDMB.2015.071544
T. Johnsten, Laura Fain, Leanna Fain, Ryan G. Benton, Ethan Butler, L. Pannell, Ming Tan
Analysing and classifying sequences based on similarities and differences is a mathematical problem of escalating relevance and importance in many scientific disciplines. One of the primary challenges in applying machine learning algorithms to sequential data, such as biological sequences, is the extraction and representation of significant features from the data. To address this problem, we have recently developed a representation, entitled Multi-Layered Vector Spaces (MLVS), which is a simple mathematical model that maps sequences into a set of MLVS. We demonstrate the usefulness of the model by applying it to the problem of identifying signal peptides. MLVS feature vectors are generated from a collection of protein sequences and the resulting vectors are used to create support vector machine classifiers. Experiments show that the MLVS-based classifiers are able to outperform or perform on par with several existing methods that are specifically designed for the purpose of identifying signal peptides.
基于相似性和差异性的序列分析和分类是许多科学学科中相关性和重要性不断上升的数学问题。将机器学习算法应用于序列数据(如生物序列)的主要挑战之一是从数据中提取和表示重要特征。为了解决这个问题,我们最近开发了一种表示,称为多层向量空间(MLVS),这是一个简单的数学模型,将序列映射到一组MLVS中。我们通过将该模型应用于识别信号肽的问题来证明该模型的实用性。从蛋白质序列的集合中生成MLVS特征向量,并使用结果向量创建支持向量机分类器。实验表明,基于mlvs的分类器能够优于或与专门设计用于识别信号肽的几种现有方法相当。
{"title":"Exploiting multi-layered vector spaces for signal peptide detection","authors":"T. Johnsten, Laura Fain, Leanna Fain, Ryan G. Benton, Ethan Butler, L. Pannell, Ming Tan","doi":"10.1504/IJDMB.2015.071544","DOIUrl":"https://doi.org/10.1504/IJDMB.2015.071544","url":null,"abstract":"Analysing and classifying sequences based on similarities and differences is a mathematical problem of escalating relevance and importance in many scientific disciplines. One of the primary challenges in applying machine learning algorithms to sequential data, such as biological sequences, is the extraction and representation of significant features from the data. To address this problem, we have recently developed a representation, entitled Multi-Layered Vector Spaces (MLVS), which is a simple mathematical model that maps sequences into a set of MLVS. We demonstrate the usefulness of the model by applying it to the problem of identifying signal peptides. MLVS feature vectors are generated from a collection of protein sequences and the resulting vectors are used to create support vector machine classifiers. Experiments show that the MLVS-based classifiers are able to outperform or perform on par with several existing methods that are specifically designed for the purpose of identifying signal peptides.","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":"13 2 1","pages":"141-57"},"PeriodicalIF":0.3,"publicationDate":"2015-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/IJDMB.2015.071544","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"66730482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
An effective hybrid approach of gene selection and classification for microarray data based on clustering and particle swarm optimisation 基于聚类和粒子群优化的基因选择与分类的有效混合方法
IF 0.3 4区 生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2015-08-01 DOI: 10.1504/IJDMB.2015.071515
Fei Han, Shanxiu Yang, Jian Guan
In this paper, a hybrid approach based on clustering and Particle Swarm Optimisation (PSO) is proposed to perform gene selection and classification for microarray data. In the new method, firstly, genes are partitioned into a predetermined number of clusters by K-means method. Since the genes in each cluster have much redundancy, Max-Relevance Min-Redundancy (mRMR) strategy is used to reduce redundancy of the clustered genes. Then, PSO is used to perform further gene selection from the remaining clustered genes. Because of its better generalisation performance with much faster convergence rate than other learning algorithms for neural networks, Extreme Learning Machine (ELM) is chosen to evaluate candidate gene subsets selected by PSO and perform samples classification in this study. The proposed method selects less redundant genes as well as increases prediction accuracy and its efficiency and effectiveness are verified by extensive comparisons with other classical methods on three open microarray data.
本文提出了一种基于聚类和粒子群优化(PSO)的混合方法对微阵列数据进行基因选择和分类。该方法首先利用K-means方法将基因划分为预定数量的聚类;由于每个聚类中的基因具有较大的冗余度,采用最大相关最小冗余度(mRMR)策略来降低聚类基因的冗余度。然后,利用粒子群算法从剩余的聚类基因中进行进一步的基因选择。由于极限学习机(Extreme learning Machine, ELM)具有比其他神经网络学习算法更好的泛化性能和更快的收敛速度,本研究选择极限学习机(Extreme learning Machine, ELM)对粒子群算法选择的候选基因子集进行评估并进行样本分类。通过与其他经典方法在三个开放芯片数据上的广泛比较,验证了该方法的效率和有效性。
{"title":"An effective hybrid approach of gene selection and classification for microarray data based on clustering and particle swarm optimisation","authors":"Fei Han, Shanxiu Yang, Jian Guan","doi":"10.1504/IJDMB.2015.071515","DOIUrl":"https://doi.org/10.1504/IJDMB.2015.071515","url":null,"abstract":"In this paper, a hybrid approach based on clustering and Particle Swarm Optimisation (PSO) is proposed to perform gene selection and classification for microarray data. In the new method, firstly, genes are partitioned into a predetermined number of clusters by K-means method. Since the genes in each cluster have much redundancy, Max-Relevance Min-Redundancy (mRMR) strategy is used to reduce redundancy of the clustered genes. Then, PSO is used to perform further gene selection from the remaining clustered genes. Because of its better generalisation performance with much faster convergence rate than other learning algorithms for neural networks, Extreme Learning Machine (ELM) is chosen to evaluate candidate gene subsets selected by PSO and perform samples classification in this study. The proposed method selects less redundant genes as well as increases prediction accuracy and its efficiency and effectiveness are verified by extensive comparisons with other classical methods on three open microarray data.","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":"33 1","pages":"103-21"},"PeriodicalIF":0.3,"publicationDate":"2015-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/IJDMB.2015.071515","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"66730384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A graph-based integrative method of detecting consistent protein functional modules from multiple data sources 从多个数据源中检测一致蛋白质功能模块的基于图的集成方法
IF 0.3 4区 生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2015-08-01 DOI: 10.1504/IJDMB.2015.071534
Yuan Zhang, Yue Cheng, Liang Ge, Nan Du, Ke-bin Jia, A. Zhang
Many clustering methods have been developed to identify functional modules in Protein-Protein Interaction (PPI) networks but the results are far from satisfaction. To overcome the noise and incomplete problems of PPI networks and find more accurate and stable functional modules, we propose an integrative method, bipartite graph-based Non-negative Matrix Factorisation method (BiNMF), in which we adopt multiple biological data sources as different views that describe PPIs. Specifically, traditional clustering models are adopted as preliminary analysis of different views of protein functional similarity. Then the intermediate clustering results are represented by a bipartite graph which can comprehensively represent the relationships between proteins and intermediate clusters and finally overlapping clustering results are achieved. Through extensive experiments, we see that our method is superior to baseline methods and detailed analysis has demonstrated the benefits of integrating diverse clustering methods and multiple biological information sources.
目前已有许多聚类方法用于识别蛋白质-蛋白质相互作用(PPI)网络中的功能模块,但结果并不令人满意。为了克服PPI网络的噪声和不完整问题,找到更准确和稳定的功能模块,我们提出了一种综合方法,基于二部图的非负矩阵分解方法(BiNMF),其中我们采用多个生物数据源作为描述PPI的不同视图。具体而言,采用传统聚类模型对蛋白质功能相似性的不同观点进行初步分析。然后将中间聚类结果用一个能全面表示蛋白质与中间聚类关系的二部图表示,最终得到重叠聚类结果。通过大量的实验,我们发现我们的方法优于基线方法,详细的分析已经证明了整合多种聚类方法和多种生物信息源的好处。
{"title":"A graph-based integrative method of detecting consistent protein functional modules from multiple data sources","authors":"Yuan Zhang, Yue Cheng, Liang Ge, Nan Du, Ke-bin Jia, A. Zhang","doi":"10.1504/IJDMB.2015.071534","DOIUrl":"https://doi.org/10.1504/IJDMB.2015.071534","url":null,"abstract":"Many clustering methods have been developed to identify functional modules in Protein-Protein Interaction (PPI) networks but the results are far from satisfaction. To overcome the noise and incomplete problems of PPI networks and find more accurate and stable functional modules, we propose an integrative method, bipartite graph-based Non-negative Matrix Factorisation method (BiNMF), in which we adopt multiple biological data sources as different views that describe PPIs. Specifically, traditional clustering models are adopted as preliminary analysis of different views of protein functional similarity. Then the intermediate clustering results are represented by a bipartite graph which can comprehensively represent the relationships between proteins and intermediate clusters and finally overlapping clustering results are achieved. Through extensive experiments, we see that our method is superior to baseline methods and detailed analysis has demonstrated the benefits of integrating diverse clustering methods and multiple biological information sources.","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":"13 2 1","pages":"122-40"},"PeriodicalIF":0.3,"publicationDate":"2015-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/IJDMB.2015.071534","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"66730463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Wavelet-based gene selection method for survival prediction in diffuse large B-cell lymphomas patients 基于小波的基因选择方法预测弥漫性大b细胞淋巴瘤患者的生存
IF 0.3 4区 生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY Pub Date : 2015-08-01 DOI: 10.1504/IJDMB.2015.071556
M. Farhadian, H. Mahjub, A. Moghimbeigi, P. Lisboa, J. Poorolajal, Muharram Mansoorizadeh
Microarray technology allows simultaneous measurements of expression levels for thousands of genes. An important aspect of microarray studies includes the prediction of patient survival based on their gene expression profile. This naturally calls for the use of a dimension reduction procedure together with the survival prediction model. In this study, a new method based on wavelet transform for survival-relevant gene selection is presented. Cox proportional hazard model is typically used to build prediction model for patients' survival using the selected genes. The prediction model will be evaluated with the R2, concordance index, likelihood ratio statistic and Akaike information criteria. The results proved that good performance of survival prediction is achieved based on the selected genes. The results suggested the possibility of developing more advanced tools based on wavelets for gene selection from microarray data sets in the context of survival analysis.
微阵列技术允许同时测量数千个基因的表达水平。微阵列研究的一个重要方面包括基于基因表达谱预测患者生存。这自然需要使用降维程序和生存预测模型。本文提出了一种基于小波变换的生存相关基因选择新方法。通常采用Cox比例风险模型,利用所选基因建立患者生存预测模型。采用R2、一致性指数、似然比统计量和赤池信息准则对预测模型进行评价。结果表明,基于所选基因的生存预测取得了较好的效果。结果表明,在生存分析的背景下,基于小波的基因选择微阵列数据集开发更先进的工具的可能性。
{"title":"Wavelet-based gene selection method for survival prediction in diffuse large B-cell lymphomas patients","authors":"M. Farhadian, H. Mahjub, A. Moghimbeigi, P. Lisboa, J. Poorolajal, Muharram Mansoorizadeh","doi":"10.1504/IJDMB.2015.071556","DOIUrl":"https://doi.org/10.1504/IJDMB.2015.071556","url":null,"abstract":"Microarray technology allows simultaneous measurements of expression levels for thousands of genes. An important aspect of microarray studies includes the prediction of patient survival based on their gene expression profile. This naturally calls for the use of a dimension reduction procedure together with the survival prediction model. In this study, a new method based on wavelet transform for survival-relevant gene selection is presented. Cox proportional hazard model is typically used to build prediction model for patients' survival using the selected genes. The prediction model will be evaluated with the R2, concordance index, likelihood ratio statistic and Akaike information criteria. The results proved that good performance of survival prediction is achieved based on the selected genes. The results suggested the possibility of developing more advanced tools based on wavelets for gene selection from microarray data sets in the context of survival analysis.","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":"13 2 1","pages":"197-210"},"PeriodicalIF":0.3,"publicationDate":"2015-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/IJDMB.2015.071556","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"66730604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
International Journal of Data Mining and Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1