International Journal of Data Mining and Bioinformatics最新文献

英文中文

A novel random forests-based feature selection method for microarray expression data analysis 基于随机森林特征选择的微阵列表达数据分析方法

IF 0.3 4区生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY

International Journal of Data Mining and Bioinformatics

Pub Date : 2015-07-01 DOI: 10.1504/IJDMB.2015.070852

Dengju Yao, Jing Yang, Xiaojuan Zhan, Xiaorong Zhan, Zhiqiang Xie

High-dimensional data and a large number of redundancy features in bioinformatics research have created an urgent need for feature selection. In this paper, a novel random forests-based feature selection method is proposed that adopts the idea of stratifying feature space and combines generalised sequence backward searching and generalised sequence forward searching strategies. A random forest variable importance score is used to rank features, and different classifiers are used as a feature subset evaluating function. The proposed method is examined on five microarray expression datasets, including leukaemia, prostate, breast, nervous and DLBCL, and the average accuracies of the SVM classifier in these datasets are 100%, 95.24%, 85%, 91.67%, and 91.67%, respectively. The results show that the proposed method could not only improve the classification accuracy but also greatly reduce the computation time of the feature selection process.

生物信息学研究中的高维数据和大量冗余特征对特征选择产生了迫切的需求。本文提出了一种新的基于随机森林的特征选择方法，该方法采用分层特征空间的思想，结合广义序列后向搜索和广义序列前向搜索策略。使用随机森林变量重要性评分对特征进行排序，并使用不同的分类器作为特征子集评估函数。在白血病、前列腺癌、乳腺癌、神经癌和DLBCL等5个微阵列表达数据集上进行检验，SVM分类器在这些数据集上的平均准确率分别为100%、95.24%、85%、91.67%和91.67%。结果表明，该方法不仅提高了分类精度，而且大大减少了特征选择过程的计算时间。

引用次数: 21

Assessing protein-protein interactions based on the semantic similarity of interacting proteins 基于相互作用蛋白的语义相似性评估蛋白-蛋白相互作用

IF 0.3 4区生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY

International Journal of Data Mining and Bioinformatics

Pub Date : 2015-07-01 DOI: 10.1504/IJDMB.2015.070842

Guangyu Cui, Byungmin Kim, Saud Alguwaizani, Kyungsook Han

The Gene Ontology (GO) has been used in estimating the semantic similarity of proteins since it has the largest and reliable vocabulary of gene products and characteristics. We developed a new method which can assess Protein-Protein Interactions (PPI) using the branching factor and information content of the common ancestor of interacting proteins in the GO hierarchy. We performed a comparative evaluation of the measure with other GO-based similarity measures and evaluation results showed that our method outperformed others in most GO domains.

基因本体(Gene Ontology, GO)由于具有最大和最可靠的基因产物和特征词汇，已被用于估计蛋白质的语义相似性。我们开发了一种新的方法来评估蛋白质-蛋白质相互作用(PPI)利用分支因子和相互作用的蛋白质在氧化石墨烯层次结构的共同祖先的信息含量。我们将该方法与其他基于GO的相似性度量进行了比较评估，评估结果表明我们的方法在大多数GO领域都优于其他方法。

引用次数: 6

TrieAMD: a scalable and efficient apriori motif discovery approach TrieAMD:一种可扩展的、高效的先验基序发现方法

IF 0.3 4区生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY

International Journal of Data Mining and Bioinformatics

Pub Date : 2015-07-01 DOI: 10.1504/IJDMB.2015.070833

Isra M. Al-Turaiki, G. Badr, H. Mathkour

Motif discovery is the problem of finding recurring patterns in biological sequences. It is one of the hardest and long-standing problems in bioinformatics. Apriori is a well-known data-mining algorithm for the discovery of frequent patterns in large datasets. In this paper, we apply the Apriori algorithm and use the Trie data structure to discover motifs. We propose several modifications so that we can adapt the classic Apriori to our problem. Experiments are conducted on Tompa's benchmark to investigate the performance of our proposed algorithm, the Trie-based Apriori Motif Discovery (TrieAMD). Results show that our algorithm outperforms all of the tested tools on real datasets for the average sensitivity measure, which means that our approach is able to discover more motifs. In terms of specificity, the performance of our algorithm is comparable to the other tools. The results also confirm both linear time and linear space scalability of the algorithm.

基序发现是在生物序列中发现重复模式的问题。这是生物信息学中最困难和长期存在的问题之一。Apriori是一种著名的数据挖掘算法，用于发现大型数据集中的频繁模式。在本文中，我们使用Apriori算法和Trie数据结构来发现motif。我们提出了一些修改，以便我们可以使经典Apriori适应我们的问题。实验在Tompa的基准上进行，以研究我们提出的算法，基于trie的Apriori Motif Discovery (TrieAMD)的性能。结果表明，我们的算法在实际数据集上的平均灵敏度测量优于所有测试工具，这意味着我们的方法能够发现更多的基序。在特异性方面，我们的算法的性能与其他工具相当。结果还证实了该算法具有线性时间和线性空间的可扩展性。

引用次数: 4

Mitigating bias in planning two-colour microarray experiments 双色微阵列实验计划中的减少偏差

IF 0.3 4区生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY

International Journal of Data Mining and Bioinformatics

Pub Date : 2015-07-01 DOI: 10.1504/IJDMB.2015.070838

Nilgun Ferhatosmanoglu, T. Allen, Ümit V. Çatalyürek

Two-colour microarrays are used to study differential gene expression on a large scale. Experimental planning can help reduce the chances of wrong inferences about whether genes are differentially expressed. Previous research on this problem has focused on minimising estimation errors (according to variance-based criteria such as A-optimality) on the basis of optimistic assumptions about the system studied. In this paper, we propose a novel planning criterion to evaluate existing plans for microarray experiments. The proposed criterion is 'Generalised-A Optimality' that is based on realistic assumptions that include bias errors. Using Generalised-A Optimality, the reference-design approach is likely to yield greater estimation accuracy in specific situations in which loop designs had previously seemed superior. However, hybrid designs are likely to offer higher estimation accuracy than reference, loop and interwoven designs having the same number of samples and slides. These findings are supported by data from both simulated and real microarray experiments.

双色微阵列被用于大规模研究差异基因表达。实验计划有助于减少对基因是否存在差异表达做出错误推断的机会。先前对该问题的研究主要集中在最小化估计误差(根据基于方差的标准，如a -最优性)，基于对所研究系统的乐观假设。在本文中，我们提出了一个新的规划准则来评估现有的微阵列实验计划。建议的标准是“广义a最优性”，它基于包括偏差误差的现实假设。使用广义a最优性，参考设计方法可能在特定情况下产生更高的估计准确性，在这种情况下，循环设计以前似乎更优越。然而，混合设计可能比具有相同数量的样本和幻灯片的参考，环路和交织设计提供更高的估计精度。这些发现得到了模拟和真实微阵列实验数据的支持。

{"title":"Mitigating bias in planning two-colour microarray experiments","authors":"Nilgun Ferhatosmanoglu, T. Allen, Ümit V. Çatalyürek","doi":"10.1504/IJDMB.2015.070838","DOIUrl":"https://doi.org/10.1504/IJDMB.2015.070838","url":null,"abstract":"Two-colour microarrays are used to study differential gene expression on a large scale. Experimental planning can help reduce the chances of wrong inferences about whether genes are differentially expressed. Previous research on this problem has focused on minimising estimation errors (according to variance-based criteria such as A-optimality) on the basis of optimistic assumptions about the system studied. In this paper, we propose a novel planning criterion to evaluate existing plans for microarray experiments. The proposed criterion is 'Generalised-A Optimality' that is based on realistic assumptions that include bias errors. Using Generalised-A Optimality, the reference-design approach is likely to yield greater estimation accuracy in specific situations in which loop designs had previously seemed superior. However, hybrid designs are likely to offer higher estimation accuracy than reference, loop and interwoven designs having the same number of samples and slides. These findings are supported by data from both simulated and real microarray experiments.","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":"13 1 1","pages":"31-49"},"PeriodicalIF":0.3,"publicationDate":"2015-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/IJDMB.2015.070838","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"66730709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

An integrated strategy for functional analysis of microbial communities based on gene ontology and 16S rRNA gene 基于基因本体和16S rRNA基因的微生物群落功能分析集成策略

IF 0.3 4区生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY

International Journal of Data Mining and Bioinformatics

Pub Date : 2015-07-01 DOI: 10.1504/IJDMB.2015.070841

Suping Deng, De-shuang Huang

In order to analyse the similarity among microbial communities on functional state after assigning 16S rRNA sequences from all microbial communities to species. It's an important addition to the species-level relationship between two compared communities and can quantify their differences in function. We downloaded all functional annotation data of several microbiotas. It's developed to identify the functional distribution and the significantly enriched functional categories of microbial communities. We analysed the similarity between two microbial communities on functional state. In the experimental results, it shows that the semantic similarity can quantify the difference between two compared species on function level. It can analyse the function of microbial communities by gene ontology based on 16S rRNA gene. Exploration of the function relationship between two sets of species assemblages will be a key result of microbiome studies and may provide new insights into assembly of a wide range of ecosystems.

将所有微生物群落的16S rRNA序列分配给物种后，分析微生物群落在功能状态上的相似性。这是两个比较群落之间物种水平关系的重要补充，可以量化它们在功能上的差异。我们下载了几种微生物群的所有功能注释数据。它的发展是为了确定微生物群落的功能分布和显著丰富的功能类别。我们分析了两个微生物群落在功能状态上的相似性。实验结果表明，语义相似度可以量化两个比较物种在功能水平上的差异。基于16S rRNA基因的基因本体可以分析微生物群落的功能。探索两组物种组合之间的功能关系将是微生物组研究的关键成果，并可能为广泛的生态系统组合提供新的见解。

引用次数: 4

Gene function prediction with knowledge from gene ontology 利用基因本体知识预测基因功能

IF 0.3 4区生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY

International Journal of Data Mining and Bioinformatics

Pub Date : 2015-07-01 DOI: 10.1504/IJDMB.2015.070840

Ying Shen, Lin Zhang

Gene function prediction is an important problem in bioinformatics. Due to the inherent noise existing in the gene expression data, the attempt to improve the prediction accuracy resorting to new classification techniques is limited. With the emergence of Gene Ontology (GO), extra knowledge about the gene products can be extracted from GO and facilitates solving the gene function prediction problem. In this paper, we propose a new method which utilises GO information to improve the classifiers' performance in gene function prediction. Specifically, our method learns a distance metric under the supervision of the GO knowledge using the distance learning technique. Compared with the traditional distance metrics, the learned one produces a better performance and consequently classification accuracy can be improved. The effectiveness of our proposed method has been corroborated by the extensive experimental results.

基因功能预测是生物信息学中的一个重要问题。由于基因表达数据中存在固有的噪声，利用新的分类技术来提高预测精度的尝试是有限的。随着基因本体(Gene Ontology, GO)的出现，可以从GO中提取关于基因产物的额外知识，有利于解决基因功能预测问题。本文提出了一种利用氧化石墨烯信息来提高分类器在基因功能预测中的性能的新方法。具体来说，我们的方法使用远程学习技术在GO知识的监督下学习距离度量。与传统的距离度量相比，学习得到的距离度量具有更好的性能，从而提高了分类精度。大量的实验结果证实了该方法的有效性。

引用次数: 1

DNA sequence and structure properties analysis reveals similarities and differences to promoters of stress responsive genes in Arabidopsis thaliana DNA序列和结构特性分析揭示了拟南芥逆境响应基因启动子的异同

IF 0.3 4区生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY

International Journal of Data Mining and Bioinformatics

Pub Date : 2015-07-01 DOI: 10.1504/IJDMB.2015.070832

P. Zhu, Yanhong Zhou, Libin Zhang, Chuang Ma

Understanding regulatory mechanisms of stress response in plants has important biological and agricultural significances. In this study, we firstly compiled a set of genes responsive to different stresses in Arabidopsis thaliana and then comparatively analysed their promoters at both the DNA sequence and three-dimensional structure levels. Amazingly, the comparison revealed that the profiles of several sequence and structure properties vary distinctly in different regions of promoters. Moreover, the content of nucleotide T and the profile of B-DNA twist are distinct in promoters from different stress groups, suggesting Arabidopsis genes might exploit different regulatory mechanisms in response to various stresses. Finally, we evaluated the performance of two representative promoter predictors including EP3 and PromPred. The evaluation results revealed their strengths and weakness for identifying stress-related promoters, providing valuable guidelines to accelerate the discovery of novel stress-related promoters and genes in plants.

了解植物胁迫反应的调控机制具有重要的生物学和农业意义。在本研究中，我们首先在拟南芥中编译了一组响应不同胁迫的基因，并在DNA序列和三维结构水平上对其启动子进行了比较分析。令人惊讶的是，比较结果显示，在启动子的不同区域中，一些序列和结构特性的分布有明显的差异。此外，不同胁迫组启动子中核苷酸T的含量和B-DNA扭曲谱存在差异，表明拟南芥基因可能利用不同的调控机制来应对不同的胁迫。最后，我们评估了两个具有代表性的启动子预测因子EP3和PromPred的性能。评价结果揭示了它们在鉴定胁迫相关启动子方面的优势和不足，为加快植物中新的胁迫相关启动子和基因的发现提供了有价值的指导。

{"title":"DNA sequence and structure properties analysis reveals similarities and differences to promoters of stress responsive genes in Arabidopsis thaliana","authors":"P. Zhu, Yanhong Zhou, Libin Zhang, Chuang Ma","doi":"10.1504/IJDMB.2015.070832","DOIUrl":"https://doi.org/10.1504/IJDMB.2015.070832","url":null,"abstract":"Understanding regulatory mechanisms of stress response in plants has important biological and agricultural significances. In this study, we firstly compiled a set of genes responsive to different stresses in Arabidopsis thaliana and then comparatively analysed their promoters at both the DNA sequence and three-dimensional structure levels. Amazingly, the comparison revealed that the profiles of several sequence and structure properties vary distinctly in different regions of promoters. Moreover, the content of nucleotide T and the profile of B-DNA twist are distinct in promoters from different stress groups, suggesting Arabidopsis genes might exploit different regulatory mechanisms in response to various stresses. Finally, we evaluated the performance of two representative promoter predictors including EP3 and PromPred. The evaluation results revealed their strengths and weakness for identifying stress-related promoters, providing valuable guidelines to accelerate the discovery of novel stress-related promoters and genes in plants.","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":"13 1 1","pages":"1-12"},"PeriodicalIF":0.3,"publicationDate":"2015-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/IJDMB.2015.070832","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"66730639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Ensemble of sparse classifiers for high-dimensional biological data. 高维生物数据的稀疏分类器集成。

IF 0.3 4区生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY

International Journal of Data Mining and Bioinformatics

Pub Date : 2015-01-01 DOI: 10.1504/ijdmb.2015.069416

Sunghan Kim, Fabien Scalzo, Donatello Telesca, Xiao Hu

Biological data are often high in dimension while the number of samples is small. In such cases, the performance of classification can be improved by reducing the dimension of data, which is referred to as feature selection. Recently, a novel feature selection method has been proposed utilising the sparsity of high-dimensional biological data where a small subset of features accounts for most variance of the dataset. In this study we propose a new classification method for high-dimensional biological data, which performs both feature selection and classification within a single framework. Our proposed method utilises a sparse linear solution technique and the bootstrap aggregating algorithm. We tested its performance on four public mass spectrometry cancer datasets along with two other conventional classification techniques such as Support Vector Machines and Adaptive Boosting. The results demonstrate that our proposed method performs more accurate classification across various cancer datasets than those conventional classification techniques.

生物数据往往维度高，而样本数量少。在这种情况下，可以通过降低数据的维数来提高分类的性能，这被称为特征选择。最近提出了一种新的特征选择方法，利用高维生物数据的稀疏性，其中一小部分特征占数据集的大部分方差。在这项研究中，我们提出了一种新的高维生物数据分类方法，该方法在单一框架内进行特征选择和分类。我们提出的方法利用稀疏线性求解技术和自举聚合算法。我们在四个公共质谱癌症数据集上测试了它的性能，以及另外两种传统的分类技术，如支持向量机和自适应增强。结果表明，我们提出的方法比传统的分类技术在各种癌症数据集上进行更准确的分类。

{"title":"Ensemble of sparse classifiers for high-dimensional biological data.","authors":"Sunghan Kim, Fabien Scalzo, Donatello Telesca, Xiao Hu","doi":"10.1504/ijdmb.2015.069416","DOIUrl":"https://doi.org/10.1504/ijdmb.2015.069416","url":null,"abstract":"Biological data are often high in dimension while the number of samples is small. In such cases, the performance of classification can be improved by reducing the dimension of data, which is referred to as feature selection. Recently, a novel feature selection method has been proposed utilising the sparsity of high-dimensional biological data where a small subset of features accounts for most variance of the dataset. In this study we propose a new classification method for high-dimensional biological data, which performs both feature selection and classification within a single framework. Our proposed method utilises a sparse linear solution technique and the bootstrap aggregating algorithm. We tested its performance on four public mass spectrometry cancer datasets along with two other conventional classification techniques such as Support Vector Machines and Adaptive Boosting. The results demonstrate that our proposed method performs more accurate classification across various cancer datasets than those conventional classification techniques.","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":"12 2","pages":"167-83"},"PeriodicalIF":0.3,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/ijdmb.2015.069416","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34123510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

A mixture of physicochemical and evolutionary-based feature extraction approaches for protein fold recognition. 混合物理化学和进化为基础的特征提取方法的蛋白质折叠识别。

IF 0.3 4区生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY

International Journal of Data Mining and Bioinformatics

Pub Date : 2015-01-01 DOI: 10.1504/ijdmb.2015.066359

Abdollah Dehzangi, Alok Sharma, James Lyons, Kuldip K Paliwal, Abdul Sattar

Recent advancement in the pattern recognition field stimulates enormous interest in Protein Fold Recognition (PFR). PFR is considered as a crucial step towards protein structure prediction and drug design. Despite all the recent achievements, the PFR still remains as an unsolved issue in biological science and its prediction accuracy still remains unsatisfactory. Furthermore, the impact of using a wide range of physicochemical-based attributes on the PFR has not been adequately explored. In this study, we propose a novel mixture of physicochemical and evolutionary-based feature extraction methods based on the concepts of segmented distribution and density. We also explore the impact of 55 different physicochemical-based attributes on the PFR. Our results show that by providing more local discriminatory information as well as obtaining benefit from both physicochemical and evolutionary-based features simultaneously, we can enhance the protein fold prediction accuracy up to 5% better than previously reported results found in the literature.

模式识别领域的最新进展激发了人们对蛋白质折叠识别(PFR)的极大兴趣。PFR被认为是蛋白质结构预测和药物设计的关键一步。尽管近年来取得了诸多成就，但PFR在生物科学领域仍是一个未解决的问题，其预测精度仍不理想。此外，广泛使用基于物理化学的属性对PFR的影响还没有得到充分的探讨。在这项研究中，我们提出了一种基于分段分布和密度概念的物理化学和进化混合特征提取方法。我们还探讨了55种不同的基于物理化学的属性对PFR的影响。我们的研究结果表明，通过提供更多的局部区别信息，同时从物理化学和基于进化的特征中获益，我们可以将蛋白质折叠预测的准确性提高到比先前文献报道的结果高5%。

{"title":"A mixture of physicochemical and evolutionary-based feature extraction approaches for protein fold recognition.","authors":"Abdollah Dehzangi, Alok Sharma, James Lyons, Kuldip K Paliwal, Abdul Sattar","doi":"10.1504/ijdmb.2015.066359","DOIUrl":"https://doi.org/10.1504/ijdmb.2015.066359","url":null,"abstract":"Recent advancement in the pattern recognition field stimulates enormous interest in Protein Fold Recognition (PFR). PFR is considered as a crucial step towards protein structure prediction and drug design. Despite all the recent achievements, the PFR still remains as an unsolved issue in biological science and its prediction accuracy still remains unsatisfactory. Furthermore, the impact of using a wide range of physicochemical-based attributes on the PFR has not been adequately explored. In this study, we propose a novel mixture of physicochemical and evolutionary-based feature extraction methods based on the concepts of segmented distribution and density. We also explore the impact of 55 different physicochemical-based attributes on the PFR. Our results show that by providing more local discriminatory information as well as obtaining benefit from both physicochemical and evolutionary-based features simultaneously, we can enhance the protein fold prediction accuracy up to 5% better than previously reported results found in the literature.","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":"11 1","pages":"115-38"},"PeriodicalIF":0.3,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/ijdmb.2015.066359","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33973465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Concepts of relative sample outlier (RSO) and weighted sample similarity (WSS) for improving performance of clustering genes: co-function and co-regulation. 提高聚类基因性能的相对样本离群值(RSO)和加权样本相似性(WSS)概念:协同功能和协同调控。

IF 0.3 4区生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY

International Journal of Data Mining and Bioinformatics

Pub Date : 2015-01-01 DOI: 10.1504/ijdmb.2015.067322

Anindya Bhattacharya, Nirmalya Chowdhury, Rajat K De

Performance of clustering algorithms is largely dependent on selected similarity measure. Efficiency in handling outliers is a major contributor to the success of a similarity measure. Better the ability of similarity measure in measuring similarity between genes in the presence of outliers, better will be the performance of the clustering algorithm in forming biologically relevant groups of genes. In the present article, we discuss the problem of handling outliers with different existing similarity measures and introduce the concepts of Relative Sample Outlier (RSO). We formulate new similarity, called Weighted Sample Similarity (WSS), incorporated in Euclidean distance and Pearson correlation coefficient and then use them in various clustering and biclustering algorithms to group different gene expression profiles. Our results suggest that WSS improves performance, in terms of finding biologically relevant groups of genes, of all the considered clustering algorithms.

聚类算法的性能很大程度上取决于所选择的相似度度量。处理异常值的效率是相似性度量成功的主要因素。在异常值存在的情况下，相似性度量在测量基因之间相似性方面的能力越好，聚类算法在形成生物相关基因群方面的性能就越好。在本文中，我们讨论了用不同的现有相似性度量来处理异常值的问题，并引入了相对样本异常值(RSO)的概念。我们提出了新的相似度，称为加权样本相似度(WSS)，结合欧几里得距离和Pearson相关系数，然后使用它们在各种聚类和双聚类算法中对不同的基因表达谱进行分组。我们的研究结果表明，在寻找生物学上相关的基因群方面，WSS提高了所有考虑的聚类算法的性能。

{"title":"Concepts of relative sample outlier (RSO) and weighted sample similarity (WSS) for improving performance of clustering genes: co-function and co-regulation.","authors":"Anindya Bhattacharya, Nirmalya Chowdhury, Rajat K De","doi":"10.1504/ijdmb.2015.067322","DOIUrl":"https://doi.org/10.1504/ijdmb.2015.067322","url":null,"abstract":"Performance of clustering algorithms is largely dependent on selected similarity measure. Efficiency in handling outliers is a major contributor to the success of a similarity measure. Better the ability of similarity measure in measuring similarity between genes in the presence of outliers, better will be the performance of the clustering algorithm in forming biologically relevant groups of genes. In the present article, we discuss the problem of handling outliers with different existing similarity measures and introduce the concepts of Relative Sample Outlier (RSO). We formulate new similarity, called Weighted Sample Similarity (WSS), incorporated in Euclidean distance and Pearson correlation coefficient and then use them in various clustering and biclustering algorithms to group different gene expression profiles. Our results suggest that WSS improves performance, in terms of finding biologically relevant groups of genes, of all the considered clustering algorithms.","PeriodicalId":54964,"journal":{"name":"International Journal of Data Mining and Bioinformatics","volume":"11 3","pages":"314-30"},"PeriodicalIF":0.3,"publicationDate":"2015-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1504/ijdmb.2015.067322","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34039166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

International Journal of Data Mining and Bioinformatics

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀