首页 > 最新文献

Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics最新文献

英文 中文
Bayesian Collective Markov Random Fields for Subcellular Localization Prediction of Human Proteins 人类蛋白质亚细胞定位预测的贝叶斯集体马尔可夫随机场
Lu Zhu, M. Ester
Advanced biotechnology makes it possible to access a multitude of heterogeneous proteomic, interactomic, genomic, and functional annotation data. One challenge in computational biology is to integrate these data to enable automated prediction of the Subcellular Localizations (SCL) of human proteins. For proteins that have multiple biological roles, their correct in silico assignment to different SCL can be considered as an imbalanced multi-label classification problem. In this study, we developed a Bayesian Collective Markov Random Fields (BCMRFs) model for multi-SCL prediction of human proteins. Given a set of unknown proteins and their corresponding protein-protein interaction (PPI) network, the SCLs of each protein can be inferred by the SCLs of its interacting partners. To do so, we integrate PPIs, the adjacency of SCLs and protein features, and perform transductive learning on the re-balanced dataset. Our experimental results show that the spatial adjacency of the SCLs improves multi-SCL prediction, especially for the SCLs with few annotated instances. Our approach outperforms the state-of-art PPI-based and feature-based multi-SCL prediction method for human proteins.
先进的生物技术使访问大量异质蛋白质组学、相互作用组学、基因组学和功能注释数据成为可能。计算生物学的一个挑战是整合这些数据以实现人类蛋白质亚细胞定位(SCL)的自动预测。对于具有多种生物学作用的蛋白质,其对不同SCL的正确计算机分配可以被认为是一个不平衡的多标签分类问题。在这项研究中,我们建立了一个贝叶斯集体马尔可夫随机场(BCMRFs)模型,用于人类蛋白质的多scl预测。给定一组未知蛋白及其相应的蛋白-蛋白相互作用(PPI)网络,每种蛋白的scl可以通过其相互作用伙伴的scl推断出来。为此,我们整合了ppi、scl的邻接性和蛋白质特征,并在重新平衡的数据集上执行转导学习。实验结果表明,scl的空间邻接性改善了多scl的预测,特别是对于注释实例较少的scl。我们的方法优于最先进的基于ppi和基于特征的人类蛋白质多scl预测方法。
{"title":"Bayesian Collective Markov Random Fields for Subcellular Localization Prediction of Human Proteins","authors":"Lu Zhu, M. Ester","doi":"10.1145/3107411.3107412","DOIUrl":"https://doi.org/10.1145/3107411.3107412","url":null,"abstract":"Advanced biotechnology makes it possible to access a multitude of heterogeneous proteomic, interactomic, genomic, and functional annotation data. One challenge in computational biology is to integrate these data to enable automated prediction of the Subcellular Localizations (SCL) of human proteins. For proteins that have multiple biological roles, their correct in silico assignment to different SCL can be considered as an imbalanced multi-label classification problem. In this study, we developed a Bayesian Collective Markov Random Fields (BCMRFs) model for multi-SCL prediction of human proteins. Given a set of unknown proteins and their corresponding protein-protein interaction (PPI) network, the SCLs of each protein can be inferred by the SCLs of its interacting partners. To do so, we integrate PPIs, the adjacency of SCLs and protein features, and perform transductive learning on the re-balanced dataset. Our experimental results show that the spatial adjacency of the SCLs improves multi-SCL prediction, especially for the SCLs with few annotated instances. Our approach outperforms the state-of-art PPI-based and feature-based multi-SCL prediction method for human proteins.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114765455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GPU-PCC: A GPU Based Technique to Compute Pairwise Pearson's Correlation Coefficients for Big fMRI Data GPU- pcc:基于GPU的fMRI大数据成对Pearson相关系数计算技术
Taban Eslami, M. Awan, F. Saeed
Functional Magnetic Resonance Imaging (fMRI) is a non-invasive brain imaging technique for studying the brain's functional activities. Pearson's Correlation Coefficient is an important measure for capturing dynamic behaviors and functional connectivity between brain components. One bottleneck in computing Correlation Coefficients is the time it takes to process big fMRI data. In this paper, we propose GPU-PCC, a GPU based algorithm based on vector dot product, which is able to compute pairwise Pearson's Correlation Coefficients while performing computation once for each pair. Our method is able to compute Correlation Coefficients in an ordered fashion without the need to do post-processing reordering of coefficients. We evaluated GPU-PCC using synthetic and real fMRI data and compared it with sequential version of computing Correlation Coefficient on CPU and existing state-of-the-art GPU method. We show that our GPU-PCC runs 94.62x faster as compared to the CPU version and 4.28x faster than the existing GPU based technique on a real fMRI dataset of size 90k voxels. The implemented code is available as GPL license on GitHub portal of our lab at https://github.com/pcdslab/GPU-PCC.
功能磁共振成像(fMRI)是一种研究大脑功能活动的非侵入性脑成像技术。皮尔逊相关系数是捕捉动态行为和脑成分之间功能连接的重要指标。计算相关系数的一个瓶颈是处理大型功能磁共振成像数据所需的时间。在本文中,我们提出了GPU- pcc算法,这是一种基于矢量点积的GPU算法,它可以计算成对的Pearson相关系数,而每对都只需要计算一次。我们的方法能够以有序的方式计算相关系数,而不需要对系数进行后处理重新排序。我们使用合成和真实的fMRI数据对GPU- pcc进行了评估,并将其与CPU上计算相关系数的顺序版本和现有的最先进的GPU方法进行了比较。我们表明,在90k体素的真实fMRI数据集上,我们的GPU- pcc比CPU版本快94.62倍,比现有的基于GPU的技术快4.28倍。实现的代码可以在我们实验室的GitHub门户网站https://github.com/pcdslab/GPU-PCC上获得GPL许可。
{"title":"GPU-PCC: A GPU Based Technique to Compute Pairwise Pearson's Correlation Coefficients for Big fMRI Data","authors":"Taban Eslami, M. Awan, F. Saeed","doi":"10.1145/3107411.3108173","DOIUrl":"https://doi.org/10.1145/3107411.3108173","url":null,"abstract":"Functional Magnetic Resonance Imaging (fMRI) is a non-invasive brain imaging technique for studying the brain's functional activities. Pearson's Correlation Coefficient is an important measure for capturing dynamic behaviors and functional connectivity between brain components. One bottleneck in computing Correlation Coefficients is the time it takes to process big fMRI data. In this paper, we propose GPU-PCC, a GPU based algorithm based on vector dot product, which is able to compute pairwise Pearson's Correlation Coefficients while performing computation once for each pair. Our method is able to compute Correlation Coefficients in an ordered fashion without the need to do post-processing reordering of coefficients. We evaluated GPU-PCC using synthetic and real fMRI data and compared it with sequential version of computing Correlation Coefficient on CPU and existing state-of-the-art GPU method. We show that our GPU-PCC runs 94.62x faster as compared to the CPU version and 4.28x faster than the existing GPU based technique on a real fMRI dataset of size 90k voxels. The implemented code is available as GPL license on GitHub portal of our lab at https://github.com/pcdslab/GPU-PCC.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"220 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115044675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Deep Residual Nets for Improved Alzheimer's Diagnosis 深度残留网络改善阿尔茨海默病诊断
Aly A. Valliani, Ameet Soni
We propose a framework that leverages deep residual CNNs pretrained on large, non-biomedical image data sets. These pretrained networks learn cross-domain features that improve low-level interpretation of images. We evaluate our model on brain imaging data and show that pretraining and the use of deep residual networks are crucial to seeing large improvements in Alzheimer's Disease diagnosis from brain MRIs.
我们提出了一个框架,利用深度残差cnn在大型非生物医学图像数据集上进行预训练。这些预训练的网络学习跨域特征,提高图像的低级解释。我们在脑成像数据上评估了我们的模型,并表明预训练和深度残差网络的使用对于从脑mri中看到阿尔茨海默病诊断的巨大改善至关重要。
{"title":"Deep Residual Nets for Improved Alzheimer's Diagnosis","authors":"Aly A. Valliani, Ameet Soni","doi":"10.1145/3107411.3108224","DOIUrl":"https://doi.org/10.1145/3107411.3108224","url":null,"abstract":"We propose a framework that leverages deep residual CNNs pretrained on large, non-biomedical image data sets. These pretrained networks learn cross-domain features that improve low-level interpretation of images. We evaluate our model on brain imaging data and show that pretraining and the use of deep residual networks are crucial to seeing large improvements in Alzheimer's Disease diagnosis from brain MRIs.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123514199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 59
Automated Protein Chain Isolation from 3D Cryo-EM Data and Volume Comparison Tool 自动蛋白链分离从3D冷冻电镜数据和体积比较工具
Michael Nissenson, Dong Si
In electron cryo-microscopy (cryo-EM), manual isolation of volumetric protein density map data surrounding known protein structures is a time-consuming process that requires constant expert attention for multiple hours. This paper presents a tool, Volume Cut, and an algorithm to automatically isolate the volumetric data surrounding individual protein chains from the entire macro-molecular complex that runs in just minutes. This tool can be used in the data collection and data pre-processing steps to generate good training datasets of single chain volume-structure pairs, which can be further used for the study of protein structure prediction from experimental 3D cryo-EM density maps using data mining and machine learning. Additionally, an application of this tool was explored in depth that compares the cut experimental cryo-EM data with simulated data in an attempt to find irregularities of experimental data for the purpose of validation. The source for both tools can be found at https://github.com/nissensonm/VolumeCut/.
在电子冷冻显微镜(cryo-EM)中,人工分离已知蛋白质结构周围的体积蛋白质密度图数据是一个耗时的过程,需要持续数小时的专家关注。本文介绍了一种工具,Volume Cut和一种算法,可以在几分钟内自动从整个大分子复合物中分离出单个蛋白质链周围的体积数据。该工具可用于数据收集和数据预处理步骤,生成单链体积-结构对的良好训练数据集,可进一步用于利用数据挖掘和机器学习从实验三维冷冻电镜密度图中预测蛋白质结构的研究。此外,还深入探讨了该工具的应用,将切割的实验低温电镜数据与模拟数据进行比较,试图找到实验数据的不规则性,以进行验证。这两个工具的源代码可以在https://github.com/nissensonm/VolumeCut/上找到。
{"title":"Automated Protein Chain Isolation from 3D Cryo-EM Data and Volume Comparison Tool","authors":"Michael Nissenson, Dong Si","doi":"10.1145/3107411.3107500","DOIUrl":"https://doi.org/10.1145/3107411.3107500","url":null,"abstract":"In electron cryo-microscopy (cryo-EM), manual isolation of volumetric protein density map data surrounding known protein structures is a time-consuming process that requires constant expert attention for multiple hours. This paper presents a tool, Volume Cut, and an algorithm to automatically isolate the volumetric data surrounding individual protein chains from the entire macro-molecular complex that runs in just minutes. This tool can be used in the data collection and data pre-processing steps to generate good training datasets of single chain volume-structure pairs, which can be further used for the study of protein structure prediction from experimental 3D cryo-EM density maps using data mining and machine learning. Additionally, an application of this tool was explored in depth that compares the cut experimental cryo-EM data with simulated data in an attempt to find irregularities of experimental data for the purpose of validation. The source for both tools can be found at https://github.com/nissensonm/VolumeCut/.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125898721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GOstruct 2.0: Automated Protein Function Prediction for Annotated Proteins GOstruct 2.0:注释蛋白的自动蛋白质功能预测
Indika Kahanda, A. Ben-Hur
Automated Protein Function Prediction is the task of automatically predicting functional annotations for a protein based on gold-standard annotations derived from experimental assays. These experiment-based annotations accumulate over time: proteins without annotations get annotated, and new functions of already annotated proteins are discovered. Therefore, function prediction can be considered a combination of two sub-tasks: making predictions on annotated proteins and making predictions on previously unannotated proteins. In previous work, we analyzed the performance of several protein function prediction methods in these two scenarios. Our results showed that GOstruct, which is based on the structured output framework, had lower accuracy in the task of predicting annotations for proteins with existing annotations, while its performance on un-annotated proteins was similar to the performance in cross-validation. In this work, we present GOstruct 2.0 which includes improvements that allow the model to make use of information of a protein's current annotations to better handle the task of predicting novel annotations for previously annotated proteins. This is highly important for model organisms where most proteins have some level of annotations. Experimental results on human data show that GOstruct 2.0 outperforms the original GOstruct in this task, demonstrating the effectiveness of the proposed improvements. This is the first study that focuses on adapting the structured output framework for applications in which labels are incomplete by nature.
自动蛋白质功能预测是基于实验分析得出的金标准注释自动预测蛋白质功能注释的任务。这些基于实验的注释随着时间的推移而积累:没有注释的蛋白质被注释,并且已经注释的蛋白质的新功能被发现。因此,功能预测可以被认为是两个子任务的组合:对已注释的蛋白质进行预测和对以前未注释的蛋白质进行预测。在之前的工作中,我们分析了几种蛋白质功能预测方法在这两种情况下的性能。我们的研究结果表明,基于结构化输出框架的GOstruct在预测具有现有注释的蛋白质的注释任务时准确率较低,而其在未注释的蛋白质上的性能与交叉验证的性能相似。在这项工作中,我们提出了GOstruct 2.0,其中包括改进,允许模型利用蛋白质当前注释的信息来更好地处理预测先前注释的蛋白质的新注释的任务。这对模式生物非常重要,因为大多数蛋白质都有一定程度的注释。在人类数据上的实验结果表明,GOstruct 2.0在此任务中的表现优于原始GOstruct,证明了所提出改进的有效性。这是第一个专注于为标签不完整的应用程序调整结构化输出框架的研究。
{"title":"GOstruct 2.0: Automated Protein Function Prediction for Annotated Proteins","authors":"Indika Kahanda, A. Ben-Hur","doi":"10.1145/3107411.3107417","DOIUrl":"https://doi.org/10.1145/3107411.3107417","url":null,"abstract":"Automated Protein Function Prediction is the task of automatically predicting functional annotations for a protein based on gold-standard annotations derived from experimental assays. These experiment-based annotations accumulate over time: proteins without annotations get annotated, and new functions of already annotated proteins are discovered. Therefore, function prediction can be considered a combination of two sub-tasks: making predictions on annotated proteins and making predictions on previously unannotated proteins. In previous work, we analyzed the performance of several protein function prediction methods in these two scenarios. Our results showed that GOstruct, which is based on the structured output framework, had lower accuracy in the task of predicting annotations for proteins with existing annotations, while its performance on un-annotated proteins was similar to the performance in cross-validation. In this work, we present GOstruct 2.0 which includes improvements that allow the model to make use of information of a protein's current annotations to better handle the task of predicting novel annotations for previously annotated proteins. This is highly important for model organisms where most proteins have some level of annotations. Experimental results on human data show that GOstruct 2.0 outperforms the original GOstruct in this task, demonstrating the effectiveness of the proposed improvements. This is the first study that focuses on adapting the structured output framework for applications in which labels are incomplete by nature.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129382818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Bayesian Hyperparameter Optimization for Machine Learning Based eQTL Analysis 基于机器学习的贝叶斯超参数优化eQTL分析
Andrew Quitadamo, James Johnson, Xinghua Shi
Machine learning methods are being applied to a wide range of problems in biology and bioinformatics. These methods often rely on configuring high level parameters, or hyperparameters, such as regularization hyperparameters in sparse learning models like graph-guided multitask Lasso methods. Different choices for these hyperparameters will lead to different results, which makes finding good hyperparameter combinations an important task when using these hyperparameter dependent methods. There are several different ways to tune hyperparameters including manual tuning, grid search, random search, and Bayesian optimization. In this paper, we apply three hyperparameter tuning strategies to eQTL analysis including grid and random search in addition to Bayesian optimization. Experiments show that the Bayesian optimization strategy outperforms the other strategies in modeling eQTL associations. Applying this strategy to assess eQTL associations using the 1000 Genomes structural variation genotypes and RNAseq data in gEUVADIS, we identify a set of new SVs associated with gene expression changes in a human population.
机器学习方法被广泛应用于生物学和生物信息学的问题。这些方法通常依赖于配置高级参数或超参数,例如稀疏学习模型中的正则化超参数,如图引导的多任务Lasso方法。这些超参数的不同选择将导致不同的结果,这使得在使用这些超参数依赖方法时找到良好的超参数组合成为一项重要任务。有几种不同的方法可以调优超参数,包括手动调优、网格搜索、随机搜索和贝叶斯优化。在本文中,除了贝叶斯优化之外,我们还将网格和随机搜索三种超参数优化策略应用于eQTL分析。实验表明,贝叶斯优化策略在eQTL关联建模方面优于其他策略。利用gEUVADIS的1000个基因组结构变异基因型和RNAseq数据,应用该策略评估eQTL关联,我们确定了一组与人类群体中基因表达变化相关的新SVs。
{"title":"Bayesian Hyperparameter Optimization for Machine Learning Based eQTL Analysis","authors":"Andrew Quitadamo, James Johnson, Xinghua Shi","doi":"10.1145/3107411.3107434","DOIUrl":"https://doi.org/10.1145/3107411.3107434","url":null,"abstract":"Machine learning methods are being applied to a wide range of problems in biology and bioinformatics. These methods often rely on configuring high level parameters, or hyperparameters, such as regularization hyperparameters in sparse learning models like graph-guided multitask Lasso methods. Different choices for these hyperparameters will lead to different results, which makes finding good hyperparameter combinations an important task when using these hyperparameter dependent methods. There are several different ways to tune hyperparameters including manual tuning, grid search, random search, and Bayesian optimization. In this paper, we apply three hyperparameter tuning strategies to eQTL analysis including grid and random search in addition to Bayesian optimization. Experiments show that the Bayesian optimization strategy outperforms the other strategies in modeling eQTL associations. Applying this strategy to assess eQTL associations using the 1000 Genomes structural variation genotypes and RNAseq data in gEUVADIS, we identify a set of new SVs associated with gene expression changes in a human population.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127148742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A Multi-view Deep Learning Method for Epileptic Seizure Detection using Short-time Fourier Transform 基于短时傅里叶变换的多视图深度学习癫痫发作检测方法
Ye Yuan, Guangxu Xun, Ke-bin Jia, Aidong Zhang
With the advances in pervasive sensor technologies, physiological signals can be captured continuously to prevent the serious outcomes caused by epilepsy. Detection of epileptic seizure onset on collected multi-channel electroencephalogram (EEG) has attracted lots of attention recently. Deep learning is a promising method to analyze large-scale unlabeled data. In this paper, we propose a multi-view deep learning model to capture brain abnormality from multi-channel epileptic EEG signals for seizure detection. Specifically, we first generate EEG spectrograms using short-time Fourier transform (STFT) to represent the time-frequency information after signal segmentation. Second, we adopt stacked sparse denoising autoencoders (SSDA) to unsupervisedly learn multiple features by considering both intra and inter correlation of EEG channels, denoted as intra-channel and cross-channel features, respectively. Third, we add an SSDA-based channel selection procedure using proposed response rate to reduce the dimension of intra-channel feature. Finally, we concatenate the learned multi-features and apply a fully-connected SSDA model with softmax classifier to jointly learn the cross-patient seizure detector in a supervised fashion. To evaluate the performance of the proposed model, we carry out experiments on a real world benchmark EEG dataset and compare it with six baselines. Extensive experimental results demonstrate that the proposed learning model is able to extract latent features with meaningful interpretation, and hence is effective in detecting epileptic seizure.
随着普适传感器技术的进步,可以连续捕获生理信号,以预防癫痫引起的严重后果。近年来,利用采集到的多通道脑电图(EEG)检测癫痫发作已引起广泛关注。深度学习是一种很有前途的分析大规模未标记数据的方法。在本文中,我们提出了一种多视图深度学习模型,从多通道癫痫脑电图信号中捕获大脑异常,用于癫痫发作检测。具体而言,我们首先使用短时傅立叶变换(STFT)来表示信号分割后的时频信息,从而生成脑电图图。其次,我们采用堆叠稀疏去噪自编码器(SSDA)来无监督地学习多个特征,同时考虑脑电信号通道的内相关性和相互相关性,分别表示为通道内和跨通道特征。第三,我们添加了一个基于ssda的信道选择过程,使用建议的响应率来降低信道内特征的维数。最后,我们将学习到的多特征连接起来,并应用一个全连接的SSDA模型和softmax分类器,以监督的方式共同学习跨患者癫痫检测器。为了评估该模型的性能,我们在真实世界的基准EEG数据集上进行了实验,并将其与六个基线进行了比较。大量的实验结果表明,所提出的学习模型能够提取潜在特征并进行有意义的解释,从而有效地检测癫痫发作。
{"title":"A Multi-view Deep Learning Method for Epileptic Seizure Detection using Short-time Fourier Transform","authors":"Ye Yuan, Guangxu Xun, Ke-bin Jia, Aidong Zhang","doi":"10.1145/3107411.3107419","DOIUrl":"https://doi.org/10.1145/3107411.3107419","url":null,"abstract":"With the advances in pervasive sensor technologies, physiological signals can be captured continuously to prevent the serious outcomes caused by epilepsy. Detection of epileptic seizure onset on collected multi-channel electroencephalogram (EEG) has attracted lots of attention recently. Deep learning is a promising method to analyze large-scale unlabeled data. In this paper, we propose a multi-view deep learning model to capture brain abnormality from multi-channel epileptic EEG signals for seizure detection. Specifically, we first generate EEG spectrograms using short-time Fourier transform (STFT) to represent the time-frequency information after signal segmentation. Second, we adopt stacked sparse denoising autoencoders (SSDA) to unsupervisedly learn multiple features by considering both intra and inter correlation of EEG channels, denoted as intra-channel and cross-channel features, respectively. Third, we add an SSDA-based channel selection procedure using proposed response rate to reduce the dimension of intra-channel feature. Finally, we concatenate the learned multi-features and apply a fully-connected SSDA model with softmax classifier to jointly learn the cross-patient seizure detector in a supervised fashion. To evaluate the performance of the proposed model, we carry out experiments on a real world benchmark EEG dataset and compare it with six baselines. Extensive experimental results demonstrate that the proposed learning model is able to extract latent features with meaningful interpretation, and hence is effective in detecting epileptic seizure.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128949886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 87
Analysis of 16S Genomic Data using Graphical Databases 用图形数据库分析16S基因组数据
O. Ahern, Rebecca J. Stevick, Li Yuan, Noah M. Daniels
Since the Human Genome Project was completed in 2003, many data scientists have developed algorithms in order to store and query high volumes of genomic data. The most common data storage techniques employed in these algorithms are flat files or relational databases. While sophisticated indexing techniques can accelerate queries, an alternative is to store biological sequence data directly in a way that supports efficient queries. Here we introduce a new algorithm that aims to compress the redundant information and improve the performance of query speed with the help of graphical databases, which have been commercial available since the mid-late 2000s. A graphical database stores information using nodes and relationships (edges). Our approach is to identify subsequences that are common among many sequences, and to store these as "common nodes" in the graphical database. This is accomplished for sequencing data as follows: split the whole sequence into k-mers: if a given k-mer is common to enough sequences, then it is labeled as a common segment; if a k-mer is unique (or common to too few sequences), then it is labeled as a single segment. Thus, common nodes and single nodes are formed from common segments and single segments, respectively. These two kinds of nodes are connected by edges in the graphical database, allowing each original sequences to be reconstructed by following edges in the graph. This graphical database model allows for fast taxonomic queries of 16S rDNA. When queried, the database can first attempt to find common nodes that match the query sequence, and subsequently follow edges to single nodes to refine the search. This approach is analogous to that of "compressive genomics", except that the compression is implicit in the graphical database storage model. Beyond simple sequence queries, this graphical database representation also supports variability analysis, which identifies highly variable vs. conserved regions of 16S sequence. Regions of low variability correspond to common nodes, while regions of high variability correspond to a variety of paths through single nodes. Figure illustrates common and single nodes, and a corresponding plot of variability. Benchmarking of sequence search indicates that query time in graphical databases is significantly faster than in flat files or relational databases. Implementation of graphical databases in genomic data analysis will allow for accelerated search, and may lend itself to other forms of efficient analysis, such as tetramer frequency analysis, which is useful in metagenomic binning.
自2003年人类基因组计划完成以来,许多数据科学家开发了算法来存储和查询大量的基因组数据。这些算法中最常用的数据存储技术是平面文件或关系数据库。虽然复杂的索引技术可以加速查询,但另一种方法是以支持高效查询的方式直接存储生物序列数据。在这里,我们介绍了一种新的算法,旨在利用图形数据库压缩冗余信息并提高查询速度的性能,图形数据库自2000年代中后期以来已经商业化。图形数据库使用节点和关系(边)存储信息。我们的方法是识别在许多序列中常见的子序列,并将它们作为“公共节点”存储在图形数据库中。这是完成测序数据如下:将整个序列拆分为k-mer:如果一个给定的k-mer是足够的序列共同的,那么它被标记为一个共同的片段;如果一个k-mer是唯一的(或对太少的序列共有),那么它被标记为单个片段。这样,公共段和单个段分别形成公共节点和单个节点。这两种节点通过图形数据库中的边连接起来,允许通过图中的边重建每个原始序列。这个图形数据库模型允许对16S rDNA进行快速的分类查询。在查询时,数据库可以首先尝试查找与查询序列匹配的公共节点,然后沿着边缘到单个节点以改进搜索。这种方法类似于“压缩基因组学”,不同之处在于压缩是隐含在图形数据库存储模型中的。除了简单的序列查询之外,这种图形数据库表示还支持可变性分析,可以识别16S序列的高度可变区域和保守区域。低变异性区域对应普通节点,高变异性区域对应通过单个节点的多种路径。图中显示了共同节点和单个节点,以及相应的变异性图。序列搜索的基准测试表明,图形数据库中的查询时间明显快于平面文件或关系数据库。在基因组数据分析中实现图形数据库将允许加速搜索,并可能有助于其他形式的有效分析,例如四聚体频率分析,这在宏基因组分组中很有用。
{"title":"Analysis of 16S Genomic Data using Graphical Databases","authors":"O. Ahern, Rebecca J. Stevick, Li Yuan, Noah M. Daniels","doi":"10.1145/3107411.3108208","DOIUrl":"https://doi.org/10.1145/3107411.3108208","url":null,"abstract":"Since the Human Genome Project was completed in 2003, many data scientists have developed algorithms in order to store and query high volumes of genomic data. The most common data storage techniques employed in these algorithms are flat files or relational databases. While sophisticated indexing techniques can accelerate queries, an alternative is to store biological sequence data directly in a way that supports efficient queries. Here we introduce a new algorithm that aims to compress the redundant information and improve the performance of query speed with the help of graphical databases, which have been commercial available since the mid-late 2000s. A graphical database stores information using nodes and relationships (edges). Our approach is to identify subsequences that are common among many sequences, and to store these as \"common nodes\" in the graphical database. This is accomplished for sequencing data as follows: split the whole sequence into k-mers: if a given k-mer is common to enough sequences, then it is labeled as a common segment; if a k-mer is unique (or common to too few sequences), then it is labeled as a single segment. Thus, common nodes and single nodes are formed from common segments and single segments, respectively. These two kinds of nodes are connected by edges in the graphical database, allowing each original sequences to be reconstructed by following edges in the graph. This graphical database model allows for fast taxonomic queries of 16S rDNA. When queried, the database can first attempt to find common nodes that match the query sequence, and subsequently follow edges to single nodes to refine the search. This approach is analogous to that of \"compressive genomics\", except that the compression is implicit in the graphical database storage model. Beyond simple sequence queries, this graphical database representation also supports variability analysis, which identifies highly variable vs. conserved regions of 16S sequence. Regions of low variability correspond to common nodes, while regions of high variability correspond to a variety of paths through single nodes. Figure illustrates common and single nodes, and a corresponding plot of variability. Benchmarking of sequence search indicates that query time in graphical databases is significantly faster than in flat files or relational databases. Implementation of graphical databases in genomic data analysis will allow for accelerated search, and may lend itself to other forms of efficient analysis, such as tetramer frequency analysis, which is useful in metagenomic binning.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132154873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Detection of Differential Abundance Intervals in Longitudinal Metagenomic Data Using Negative Binomial Smoothing Spline ANOVA 利用负二项平滑样条方差分析检测纵向宏基因组数据的差异丰度区间
Ahmed A. Metwally, P. Finn, Yang Dai, D. Perkins
Metagenomic longitudinal studies have become a widely-used study design to investigate the dynamics of the microbial ecological systems and their temporal effects. One of the important questions to be addressed in longitudinal studies is the identification of time intervals when microbial features show changes in their abundance. We propose a statistical method that is based on a semi-parametric Smoothing Spline ANOVA and negative binomial distribution to model the time-course of the features between two phenotypes. We demonstrate the superior performance of our proposed method compared to the two currently existing methods using simulated data. We present the analysis results of our proposed method in an analysis of a longitudinal dataset that investigates the association between the development of type 1 diabetes in infants and the gut microbiome. The identified significant species and their specific time intervals reveal new information that can be used in improving intervention or treatment plans.
宏基因组纵向研究已成为研究微生物生态系统动态及其时间效应的一种广泛使用的研究设计。在纵向研究中要解决的一个重要问题是确定微生物特征在其丰度上显示变化的时间间隔。我们提出了一种基于半参数平滑样条方差分析和负二项分布的统计方法来模拟两种表型之间特征的时间过程。我们用模拟数据证明了我们所提出的方法与现有的两种方法相比具有优越的性能。我们在对纵向数据集的分析中提出了我们提出的方法的分析结果,该数据集调查了婴儿1型糖尿病的发展与肠道微生物组之间的关系。已确定的重要物种及其特定的时间间隔揭示了可用于改进干预或治疗计划的新信息。
{"title":"Detection of Differential Abundance Intervals in Longitudinal Metagenomic Data Using Negative Binomial Smoothing Spline ANOVA","authors":"Ahmed A. Metwally, P. Finn, Yang Dai, D. Perkins","doi":"10.1145/3107411.3107429","DOIUrl":"https://doi.org/10.1145/3107411.3107429","url":null,"abstract":"Metagenomic longitudinal studies have become a widely-used study design to investigate the dynamics of the microbial ecological systems and their temporal effects. One of the important questions to be addressed in longitudinal studies is the identification of time intervals when microbial features show changes in their abundance. We propose a statistical method that is based on a semi-parametric Smoothing Spline ANOVA and negative binomial distribution to model the time-course of the features between two phenotypes. We demonstrate the superior performance of our proposed method compared to the two currently existing methods using simulated data. We present the analysis results of our proposed method in an analysis of a longitudinal dataset that investigates the association between the development of type 1 diabetes in infants and the gut microbiome. The identified significant species and their specific time intervals reveal new information that can be used in improving intervention or treatment plans.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130942891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Analysis of Controls in ChIP-seq ChIP-seq中的控制分析
Aseel Awdeh, T. Perkins
The chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq) method, initially introduced a decade ago, is widely used by the scientific community to detect protein/DNA binding and histone modifications across the genome in various cell lines. Every experiment is prone to noise and bias, and ChIP-seq experiments are no exception. To alleviate bias, incorporation of control datasets in ChIP-seq analysis is an essential step. The controls are used to detect background signal, whilst the ChIP-seq experiment captures the true binding or histone modification signal. However, a recurrent issue is the existence of noise and bias in the controls themselves, as well as different types of bias in ChIP-seq experiments. Thus, depending on which controls are used, peak calling can produce different results (i.e., binding site positions) for the same ChIP-seq experiment. Consequently, generating "smart" controls, which model the non-signal effect for a specific ChIP-seq experiment, could enhance contrast and thus increase the reliability and reproducibility of the results. Our analysis aims to improve our understanding of ChIP-seq controls and their biases. We use unsupervised clustering and dimensionality reduction techniques to compare 160 controls for the K562 cell line in the ENCODE project, finding distincting groupings of controls which correlate to experimental characteristics. To customize a control for each ChIP-seq experiment, we use LASSO regression to fit a sparse set of controls to each of 500 ChIP-seq experiments (again, from ENCODE data for the K562 cell line). We look at how many controls are selected, which controls are used per ChIP-seq experiment, and how they are related to the different ChIP-seq experiment characteristics. Perhaps most surprisingly, we find that the LASSO models are not particularly sparse, often including half of the possible controls to model any given ChIP-seq. Cross-validation as well as testing with smaller sets of candidate controls proves that such large numbers of controls are beneficial for modeling ChIP-seq background distributions. We also observe clusters of ChIP-seq experiments that tend to rely on clusters of controls, and we look at the experimental characteristics that tend to cause a given control to be useful in modeling the background of a given ChIP-seq experiment. Through these analyses, we attempt to answer largely-unstudied questions regarding how much control data and of what types are useful in ChIP-seq analysis, and how suitable controls can be matched to ChIP-seq datasets.
染色质免疫沉淀后高通量测序(ChIP-seq)方法最初于十年前引入,被科学界广泛用于检测各种细胞系基因组中的蛋白质/DNA结合和组蛋白修饰。每个实验都容易产生噪声和偏差,ChIP-seq实验也不例外。为了减轻偏差,在ChIP-seq分析中纳入对照数据集是必不可少的一步。对照组用于检测背景信号,而ChIP-seq实验捕获真正的结合或组蛋白修饰信号。然而,一个反复出现的问题是控制本身存在噪声和偏差,以及ChIP-seq实验中不同类型的偏差。因此,根据使用的控件的不同,峰值调用可以为相同的ChIP-seq实验产生不同的结果(即结合位点位置)。因此,生成“智能”控制,为特定ChIP-seq实验模拟非信号效应,可以增强对比,从而提高结果的可靠性和可重复性。我们的分析旨在提高我们对ChIP-seq控制及其偏差的理解。我们使用无监督聚类和降维技术来比较ENCODE项目中160个K562细胞系的对照,发现与实验特征相关的对照的不同分组。为了为每个ChIP-seq实验定制一个控制,我们使用LASSO回归来拟合500个ChIP-seq实验中的每个稀疏控制集(同样,来自K562细胞系的ENCODE数据)。我们查看选择了多少个控制,每个ChIP-seq实验使用了哪些控制,以及它们与不同ChIP-seq实验特征之间的关系。也许最令人惊讶的是,我们发现LASSO模型并不是特别稀疏,通常包括一半可能的控制来模拟任何给定的ChIP-seq。交叉验证以及使用较小的候选对照集进行测试证明,如此大量的对照对ChIP-seq背景分布建模是有益的。我们还观察到ChIP-seq实验的集群往往依赖于控制集群,我们看到的实验特征往往会导致一个给定的控制是有用的,在一个给定的ChIP-seq实验的背景建模。通过这些分析,我们试图回答在ChIP-seq分析中有多少对照数据和什么类型的对照数据有用,以及如何将合适的对照与ChIP-seq数据集匹配等大量未研究的问题。
{"title":"Analysis of Controls in ChIP-seq","authors":"Aseel Awdeh, T. Perkins","doi":"10.1145/3107411.3108230","DOIUrl":"https://doi.org/10.1145/3107411.3108230","url":null,"abstract":"The chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq) method, initially introduced a decade ago, is widely used by the scientific community to detect protein/DNA binding and histone modifications across the genome in various cell lines. Every experiment is prone to noise and bias, and ChIP-seq experiments are no exception. To alleviate bias, incorporation of control datasets in ChIP-seq analysis is an essential step. The controls are used to detect background signal, whilst the ChIP-seq experiment captures the true binding or histone modification signal. However, a recurrent issue is the existence of noise and bias in the controls themselves, as well as different types of bias in ChIP-seq experiments. Thus, depending on which controls are used, peak calling can produce different results (i.e., binding site positions) for the same ChIP-seq experiment. Consequently, generating \"smart\" controls, which model the non-signal effect for a specific ChIP-seq experiment, could enhance contrast and thus increase the reliability and reproducibility of the results. Our analysis aims to improve our understanding of ChIP-seq controls and their biases. We use unsupervised clustering and dimensionality reduction techniques to compare 160 controls for the K562 cell line in the ENCODE project, finding distincting groupings of controls which correlate to experimental characteristics. To customize a control for each ChIP-seq experiment, we use LASSO regression to fit a sparse set of controls to each of 500 ChIP-seq experiments (again, from ENCODE data for the K562 cell line). We look at how many controls are selected, which controls are used per ChIP-seq experiment, and how they are related to the different ChIP-seq experiment characteristics. Perhaps most surprisingly, we find that the LASSO models are not particularly sparse, often including half of the possible controls to model any given ChIP-seq. Cross-validation as well as testing with smaller sets of candidate controls proves that such large numbers of controls are beneficial for modeling ChIP-seq background distributions. We also observe clusters of ChIP-seq experiments that tend to rely on clusters of controls, and we look at the experimental characteristics that tend to cause a given control to be useful in modeling the background of a given ChIP-seq experiment. Through these analyses, we attempt to answer largely-unstudied questions regarding how much control data and of what types are useful in ChIP-seq analysis, and how suitable controls can be matched to ChIP-seq datasets.","PeriodicalId":246388,"journal":{"name":"Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129288684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1