Proceedings. IEEE International Conference on Bioinformatics and Biomedicine最新文献_第7页

Fast Multi-Task SCCA Learning with Feature Selection for Multi-Modal Brain Imaging Genetics. 基于多模态脑成像遗传学特征选择的快速多任务SCCA学习。

Proceedings. IEEE International Conference on Bioinformatics and Biomedicine

Pub Date : 2018-12-01 Epub Date: 2019-01-24 DOI: 10.1109/BIBM.2018.8621298

Lei Du, Kefei Liu, Xiaohui Yao, Shannon L Risacher, Junwei Han, Lei Guo, Andrew J Saykin, Li Shen

Brain imaging genetics studies the genetic basis of brain structures and functions via integrating both genotypic data such as single nucleotide polymorphism (SNP) and imaging quantitative traits (QTs). In this area, both multi-task learning (MTL) and sparse canonical correlation analysis (SCCA) methods are widely used since they are superior to those independent and pairwise univariate analyses. MTL methods generally incorporate a few of QTs and are not designed for feature selection from a large number of QTs; while existing SCCA methods typically employ only one modality of QTs to study its association with SNPs. Both MTL and SCCA encounter computational challenges as the number of SNPs increases. In this paper, combining the merits of MTL and SCCA, we propose a novel multi-task SCCA (MTSCCA) learning framework to identify bi-multivariate associations between SNPs and multi-modal imaging QTs. MTSCCA could make use of the complementary information carried by different imaging modalities. Using the G _2,1-norm regularization, MTSCCA treats all SNPs in the same group together to enforce sparsity at the group level. The $l_{2, 1}$ -norm penalty is used to jointly select features across multiple tasks for SNPs, and across multiple modalities for QTs. A fast optimization algorithm is proposed using the grouping information of SNPs. Compared with conventional SCCA methods, MTSCCA obtains improved performance regarding both correlation coefficients and canonical weights patterns. In addition, our method runs very fast and is easy-to-implement, and thus could provide a powerful tool for genome-wide brain-wide imaging genetic studies.

脑成像遗传学通过整合单核苷酸多态性(SNP)和成像定量性状(QTs)等基因型数据，研究脑结构和功能的遗传基础。在这一领域，多任务学习(MTL)和稀疏典型相关分析(SCCA)方法由于其优于独立和两两单变量分析而被广泛使用。MTL方法通常包含少量的qt，并且不是为从大量的qt中选择特征而设计的;而现有的SCCA方法通常只采用一种qt模式来研究其与snp的关系。随着snp数量的增加，MTL和SCCA都遇到了计算上的挑战。在本文中，我们结合MTL和SCCA的优点，提出了一个新的多任务SCCA (MTSCCA)学习框架来识别snp和多模态成像qt之间的双多元关联。MTSCCA可以利用不同成像方式所携带的互补信息。使用g2.1范数正则化，MTSCCA将同一组中的所有snp一起处理，以在组级别强制稀疏性。1.1范数惩罚用于snp跨多个任务和qt跨多个模式的联合选择特征。提出了一种利用snp分组信息的快速优化算法。与传统的SCCA方法相比，MTSCCA方法在相关系数和典型权重模式方面都具有更好的性能。此外，我们的方法运行速度快，易于实现，因此可以为全基因组全脑成像遗传学研究提供一个强大的工具。

{"title":"Fast Multi-Task SCCA Learning with Feature Selection for Multi-Modal Brain Imaging Genetics.","authors":"Lei Du, Kefei Liu, Xiaohui Yao, Shannon L Risacher, Junwei Han, Lei Guo, Andrew J Saykin, Li Shen","doi":"10.1109/BIBM.2018.8621298","DOIUrl":"https://doi.org/10.1109/BIBM.2018.8621298","url":null,"abstract":"Brain imaging genetics studies the genetic basis of brain structures and functions via integrating both genotypic data such as single nucleotide polymorphism (SNP) and imaging quantitative traits (QTs). In this area, both multi-task learning (MTL) and sparse canonical correlation analysis (SCCA) methods are widely used since they are superior to those independent and pairwise univariate analyses. MTL methods generally incorporate a few of QTs and are not designed for feature selection from a large number of QTs; while existing SCCA methods typically employ only one modality of QTs to study its association with SNPs. Both MTL and SCCA encounter computational challenges as the number of SNPs increases. In this paper, combining the merits of MTL and SCCA, we propose a novel multi-task SCCA (MTSCCA) learning framework to identify bi-multivariate associations between SNPs and multi-modal imaging QTs. MTSCCA could make use of the complementary information carried by different imaging modalities. Using the G 2,1-norm regularization, MTSCCA treats all SNPs in the same group together to enforce sparsity at the group level. The <math> <mrow><msub><mi>l</mi> <mrow><mn>2</mn> <mo>,</mo> <mn>1</mn></mrow> </msub> </mrow> </math> -norm penalty is used to jointly select features across multiple tasks for SNPs, and across multiple modalities for QTs. A fast optimization algorithm is proposed using the grouping information of SNPs. Compared with conventional SCCA methods, MTSCCA obtains improved performance regarding both correlation coefficients and canonical weights patterns. In addition, our method runs very fast and is easy-to-implement, and thus could provide a powerful tool for genome-wide brain-wide imaging genetic studies.","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2018 ","pages":"356-361"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/BIBM.2018.8621298","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37065392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

INDEED: R package for network based differential expression analysis. 基于网络的差分表达式分析的R包。

Proceedings. IEEE International Conference on Bioinformatics and Biomedicine

Pub Date : 2018-12-01 Epub Date: 2019-01-24 DOI: 10.1109/BIBM.2018.8621426

Zhenzhi Li, Yiming Zuo, Chaohui Xu, Rency S Varghese, Habtom W Ressom

With recent advancement of omics technologies, fueled by decreased cost and increased number of available datasets, computational methods for differential expression analysis are sought to identify disease-associated biomolecules. Conventional differential expression analysis methods (e.g. student's t-test, ANOVA) focus on assessing mean and variance of biomolecules in each biological group. On the other hand, network-based approaches take into account the interactions between biomolecules in choosing differentially expressed ones. These interactions are typically evaluated by correlation methods that tend to generate over-complicated networks due to many seemingly indirect associations. In this paper, we introduce a new R/Bioconductor package INDEED that allows users to construct a sparse network based on partial correlation, and to identify biomolecules that have significant changes both at individual expression and pairwise interaction levels. We applied INDEED for analysis of two omic datasets acquired in a cancer biomarker discovery study to help rank disease-associated biomolecules. We believe biomolecules selected by INDEED lead to improved sensitivity and specificity in detecting disease status compared to those selected by conventional statistical methods. Also, INDEED's framework is amenable to further expansion to integrate networks from multi-omic studies, thereby allowing selection of reliable disease-associated biomolecules or disease biomarkers.

随着组学技术的进步，成本的降低和可用数据集数量的增加，人们寻求用于差异表达分析的计算方法来识别疾病相关的生物分子。传统的差异表达分析方法(如学生t检验、方差分析)侧重于评估每个生物组中生物分子的均值和方差。另一方面，基于网络的方法在选择差异表达分子时考虑了生物分子之间的相互作用。这些相互作用通常通过相关方法进行评估，由于许多看似间接的关联，这些方法往往会产生过于复杂的网络。在本文中，我们引入了一个新的R/Bioconductor包INDEED，它允许用户构建基于偏相关的稀疏网络，并识别在个体表达和成对相互作用水平上都有显著变化的生物分子。我们应用了INDEED对癌症生物标志物发现研究中获得的两个组学数据集进行分析，以帮助对疾病相关生物分子进行排序。我们相信，与传统统计方法相比，通过INDEED选择的生物分子在检测疾病状态方面具有更高的敏感性和特异性。此外，INDEED的框架可以进一步扩展，以整合来自多组学研究的网络，从而允许选择可靠的疾病相关生物分子或疾病生物标志物。

{"title":"INDEED: R package for network based differential expression analysis.","authors":"Zhenzhi Li, Yiming Zuo, Chaohui Xu, Rency S Varghese, Habtom W Ressom","doi":"10.1109/BIBM.2018.8621426","DOIUrl":"https://doi.org/10.1109/BIBM.2018.8621426","url":null,"abstract":"With recent advancement of omics technologies, fueled by decreased cost and increased number of available datasets, computational methods for differential expression analysis are sought to identify disease-associated biomolecules. Conventional differential expression analysis methods (e.g. student's t-test, ANOVA) focus on assessing mean and variance of biomolecules in each biological group. On the other hand, network-based approaches take into account the interactions between biomolecules in choosing differentially expressed ones. These interactions are typically evaluated by correlation methods that tend to generate over-complicated networks due to many seemingly indirect associations. In this paper, we introduce a new R/Bioconductor package INDEED that allows users to construct a sparse network based on partial correlation, and to identify biomolecules that have significant changes both at individual expression and pairwise interaction levels. We applied INDEED for analysis of two omic datasets acquired in a cancer biomarker discovery study to help rank disease-associated biomolecules. We believe biomolecules selected by INDEED lead to improved sensitivity and specificity in detecting disease status compared to those selected by conventional statistical methods. Also, INDEED's framework is amenable to further expansion to integrate networks from multi-omic studies, thereby allowing selection of reliable disease-associated biomolecules or disease biomarkers.","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2018 ","pages":"2709-2712"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/BIBM.2018.8621426","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37313557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Deep Generative Classifiers for Thoracic Disease Diagnosis with Chest X-ray Images. 胸部X射线图像用于胸部疾病诊断的深度生成分类器。

Proceedings. IEEE International Conference on Bioinformatics and Biomedicine

Pub Date : 2018-12-01 Epub Date: 2019-01-24 DOI: 10.1109/BIBM.2018.8621107

Chengsheng Mao, Yiheng Pan, Zexian Zeng, Liang Yao, Yuan Luo

Thoracic diseases are very serious health problems that plague a large number of people. Chest X-ray is currently one of the most popular methods to diagnose thoracic diseases, playing an important role in the healthcare workflow. However, reading the chest X-ray images and giving an accurate diagnosis remain challenging tasks for expert radiologists. With the success of deep learning in computer vision, a growing number of deep neural network architectures were applied to chest X-ray image classification. However, most of the previous deep neural network classifiers were based on deterministic architectures which are usually very noise-sensitive and are likely to aggravate the overfitting issue. In this paper, to make a deep architecture more robust to noise and to reduce overfitting, we propose using deep generative classifiers to automatically diagnose thorax diseases from the chest X-ray images. Unlike the traditional deterministic classifier, a deep generative classifier has a distribution middle layer in the deep neural network. A sampling layer then draws a random sample from the distribution layer and input it to the following layer for classification. The classifier is generative because the class label is generated from samples of a related distribution. Through training the model with a certain amount of randomness, the deep generative classifiers are expected to be robust to noise and can reduce overfitting and then achieve good performances. We implemented our deep generative classifiers based on a number of well-known deterministic neural network architectures, and tested our models on the chest X-ray14 dataset. The results demonstrated the superiority of deep generative classifiers compared with the corresponding deep deterministic classifiers.

胸科疾病是困扰大量人群的非常严重的健康问题。胸部X光检查是目前最流行的胸部疾病诊断方法之一，在医疗工作流程中发挥着重要作用。然而，对于放射科医生来说，读取胸部X光图像并给出准确诊断仍然是一项具有挑战性的任务。随着深度学习在计算机视觉中的成功，越来越多的深度神经网络架构被应用于胸部X射线图像分类。然而，以前的大多数深度神经网络分类器都是基于确定性架构的，这些架构通常对噪声非常敏感，可能会加剧过拟合问题。在本文中，为了使深度架构对噪声更具鲁棒性并减少过拟合，我们建议使用深度生成分类器从胸部X射线图像中自动诊断胸部疾病。与传统的确定性分类器不同，深度生成分类器在深度神经网络中具有分布中间层。采样层然后从分布层中抽取随机样本，并将其输入到下一层进行分类。分类器是生成的，因为类标签是从相关分布的样本中生成的。通过训练具有一定随机性的模型，期望深度生成分类器对噪声具有鲁棒性，可以减少过拟合，从而获得良好的性能。我们基于许多众所周知的确定性神经网络架构实现了我们的深度生成分类器，并在胸部X-ray14数据集上测试了我们的模型。结果表明，与相应的深度确定性分类器相比，深度生成分类器具有优越性。

{"title":"Deep Generative Classifiers for Thoracic Disease Diagnosis with Chest X-ray Images.","authors":"Chengsheng Mao, Yiheng Pan, Zexian Zeng, Liang Yao, Yuan Luo","doi":"10.1109/BIBM.2018.8621107","DOIUrl":"https://doi.org/10.1109/BIBM.2018.8621107","url":null,"abstract":"Thoracic diseases are very serious health problems that plague a large number of people. Chest X-ray is currently one of the most popular methods to diagnose thoracic diseases, playing an important role in the healthcare workflow. However, reading the chest X-ray images and giving an accurate diagnosis remain challenging tasks for expert radiologists. With the success of deep learning in computer vision, a growing number of deep neural network architectures were applied to chest X-ray image classification. However, most of the previous deep neural network classifiers were based on deterministic architectures which are usually very noise-sensitive and are likely to aggravate the overfitting issue. In this paper, to make a deep architecture more robust to noise and to reduce overfitting, we propose using deep generative classifiers to automatically diagnose thorax diseases from the chest X-ray images. Unlike the traditional deterministic classifier, a deep generative classifier has a distribution middle layer in the deep neural network. A sampling layer then draws a random sample from the distribution layer and input it to the following layer for classification. The classifier is generative because the class label is generated from samples of a related distribution. Through training the model with a certain amount of randomness, the deep generative classifiers are expected to be robust to noise and can reduce overfitting and then achieve good performances. We implemented our deep generative classifiers based on a number of well-known deterministic neural network architectures, and tested our models on the chest X-ray14 dataset. The results demonstrated the superiority of deep generative classifiers compared with the corresponding deep deterministic classifiers.","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2018 ","pages":"1209-1214"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/BIBM.2018.8621107","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41223004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Early Prediction of Acute Kidney Injury in Critical Care Setting Using Clinical Notes. 利用临床笔记早期预测重症监护中的急性肾损伤。

Proceedings. IEEE International Conference on Bioinformatics and Biomedicine

Pub Date : 2018-12-01 Epub Date: 2019-01-24 DOI: 10.1109/bibm.2018.8621574

Yikuan Li, Liang Yao, Chengsheng Mao, Anand Srivastava, Xiaoqian Jiang, Yuan Luo

Acute kidney injury (AKI) in critically ill patients is associated with significant morbidity and mortality. Development of novel methods to identify patients with AKI earlier will allow for testing of novel strategies to prevent or reduce the complications of AKI. We developed data-driven prediction models to estimate the risk of new AKI onset. We generated models from clinical notes within the first 24 hours following intensive care unit (ICU) admission extracted from Medical Information Mart for Intensive Care III (MIMIC-III). From the clinical notes, we generated clinically meaningful word and concept representations and embeddings, respectively. Five supervised learning classifiers and knowledge-guided deep learning architecture were used to construct prediction models. The best configuration yielded a competitive AUC of 0.779. Our work suggests that natural language processing of clinical notes can be applied to assist clinicians in identifying the risk of incident AKI onset in critically ill patients upon admission to the ICU.

危重病人的急性肾损伤（AKI）与严重的发病率和死亡率有关。开发新的方法来尽早识别急性肾损伤患者，将有助于测试预防或减少急性肾损伤并发症的新策略。我们开发了数据驱动的预测模型来估计新发 AKI 的风险。我们从重症监护医学信息市场 III（MIMIC-III）中提取的重症监护病房（ICU）入院后 24 小时内的临床记录中生成了模型。从临床笔记中，我们分别生成了具有临床意义的单词和概念表示及嵌入。我们使用五个监督学习分类器和知识引导的深度学习架构来构建预测模型。最佳配置的AUC达到了0.779。我们的工作表明，临床笔记的自然语言处理可用于协助临床医生识别重症患者在入住重症监护室时发生 AKI 的风险。

{"title":"Early Prediction of Acute Kidney Injury in Critical Care Setting Using Clinical Notes.","authors":"Yikuan Li, Liang Yao, Chengsheng Mao, Anand Srivastava, Xiaoqian Jiang, Yuan Luo","doi":"10.1109/bibm.2018.8621574","DOIUrl":"10.1109/bibm.2018.8621574","url":null,"abstract":"Acute kidney injury (AKI) in critically ill patients is associated with significant morbidity and mortality. Development of novel methods to identify patients with AKI earlier will allow for testing of novel strategies to prevent or reduce the complications of AKI. We developed data-driven prediction models to estimate the risk of new AKI onset. We generated models from clinical notes within the first 24 hours following intensive care unit (ICU) admission extracted from Medical Information Mart for Intensive Care III (MIMIC-III). From the clinical notes, we generated clinically meaningful word and concept representations and embeddings, respectively. Five supervised learning classifiers and knowledge-guided deep learning architecture were used to construct prediction models. The best configuration yielded a competitive AUC of 0.779. Our work suggests that natural language processing of clinical notes can be applied to assist clinicians in identifying the risk of incident AKI onset in critically ill patients upon admission to the ICU.","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2018 ","pages":"683-686"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7768909/pdf/nihms-1656128.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38762863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Collaborative Phenotype Inference from Comorbid Substance Use Disorders and Genotypes. 从合并的药物使用障碍和基因型中合作推断表型。

Proceedings. IEEE International Conference on Bioinformatics and Biomedicine

Pub Date : 2017-11-01 Epub Date: 2017-12-18 DOI: 10.1109/BIBM.2017.8217681

Jin Lu, Jiangwen Sun, Xinyu Wang, Henry R Kranzler, Joel Gelernter, Jinbo Bi

Data in large-scale genetic studies of complex human diseases, such as substance use disorders, are often incomplete. Despite great progress in genotype imputation, e.g., the IMPUTE2 method, considerably less progress has been made in inferring phenotypes. We designed a novel approach to integrate individuals' comorbid conditions with their genotype data to infer missing (unreported) diagnostic criteria of a disorder. The premise of our approach derives from correlations among symptoms and the shared biological bases of concurrent disorders such as co-dependence on cocaine and opioids. We describe a matrix completion method to construct a bi-linear model based on the interactions of genotypes and known symptoms of related disorders to infer unknown values of another set of symptoms or phenotypes. An efficient stochastic and parallel algorithm based on the linearized alternating direction method of multipliers was developed to solve the proposed optimization problem. Empirical evaluation of the approach in comparison with other advanced data matrix completion methods via a case study shows that it both significantly improves imputation accuracy and provides greater computational efficiency.

对人类复杂疾病（如药物使用障碍）进行大规模遗传研究的数据往往是不完整的。尽管在基因型估算（如 IMPUTE2 方法）方面取得了很大进展，但在推断表型方面的进展却少得多。我们设计了一种新方法，将个人的合并症与其基因型数据整合在一起，以推断缺失的（未报告的）疾病诊断标准。我们的方法的前提是症状之间的相关性以及并发症（如对可卡因和阿片类药物的共同依赖）的共同生物学基础。我们介绍了一种矩阵补全方法，该方法基于基因型和相关疾病的已知症状之间的相互作用构建一个双线性模型，以推断另一组症状或表型的未知值。为了解决提出的优化问题，我们开发了一种基于线性化交替乘法的高效随机并行算法。通过案例研究对该方法与其他先进的数据矩阵补全方法进行的经验评估表明，该方法既能显著提高估算精度，又能提供更高的计算效率。

{"title":"Collaborative Phenotype Inference from Comorbid Substance Use Disorders and Genotypes.","authors":"Jin Lu, Jiangwen Sun, Xinyu Wang, Henry R Kranzler, Joel Gelernter, Jinbo Bi","doi":"10.1109/BIBM.2017.8217681","DOIUrl":"10.1109/BIBM.2017.8217681","url":null,"abstract":"Data in large-scale genetic studies of complex human diseases, such as substance use disorders, are often incomplete. Despite great progress in genotype imputation, e.g., the IMPUTE2 method, considerably less progress has been made in inferring phenotypes. We designed a novel approach to integrate individuals' comorbid conditions with their genotype data to infer missing (unreported) diagnostic criteria of a disorder. The premise of our approach derives from correlations among symptoms and the shared biological bases of concurrent disorders such as co-dependence on cocaine and opioids. We describe a matrix completion method to construct a bi-linear model based on the interactions of genotypes and known symptoms of related disorders to infer unknown values of another set of symptoms or phenotypes. An efficient stochastic and parallel algorithm based on the linearized alternating direction method of multipliers was developed to solve the proposed optimization problem. Empirical evaluation of the approach in comparison with other advanced data matrix completion methods via a case study shows that it both significantly improves imputation accuracy and provides greater computational efficiency.","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2017 ","pages":"392-397"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5947969/pdf/nihms913259.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36094670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Variable Selection in Heterogeneous Datasets: A Truncated-rank Sparse Linear Mixed Model with Applications to Genome-wide Association Studies. 异质数据集中的变量选择：截断秩稀疏线性混合模型在全基因组关联研究中的应用》（Truncated-rank Sparse Linear Mixed Model with Applications to Genome-wide Association Studies）。

Proceedings. IEEE International Conference on Bioinformatics and Biomedicine

Pub Date : 2017-11-01 Epub Date: 2017-12-18 DOI: 10.1109/BIBM.2017.8217687

Haohan Wang, Bryon Aragam, Eric P Xing

A fundamental and important challenge in modern datasets of ever increasing dimensionality is variable selection, which has taken on renewed interest recently due to the growth of biological and medical datasets with complex, non-i.i.d. structures. Naïvely applying classical variable selection methods such as the Lasso to such datasets may lead to a large number of false discoveries. Motivated by genome-wide association studies in genetics, we study the problem of variable selection for datasets arising from multiple subpopulations, when this underlying population structure is unknown to the researcher. We propose a unified framework for sparse variable selection that adaptively corrects for population structure via a low-rank linear mixed model. Most importantly, the proposed method does not require prior knowledge of individual relationships in the data and adaptively selects a covariance structure of the correct complexity. Through extensive experiments, we illustrate the effectiveness of this framework over existing methods. Further, we test our method on three different genomic datasets from plants, mice, and humans, and discuss the knowledge we discover with our model.

在维度不断增加的现代数据集中，变量选择是一个基本而重要的挑战。最近，由于具有复杂、非 i.i.d 结构的生物和医学数据集的增加，变量选择再次引起了人们的关注。在此类数据集上天真地应用经典变量选择方法（如 Lasso）可能会导致大量错误发现。受遗传学中全基因组关联研究的启发，我们研究了在研究人员不知道潜在种群结构的情况下，对来自多个亚种群的数据集进行变量选择的问题。我们提出了一个统一的稀疏变量选择框架，通过低秩线性混合模型对种群结构进行自适应校正。最重要的是，我们提出的方法不需要事先了解数据中的个体关系，就能自适应地选择具有正确复杂性的协方差结构。通过大量实验，我们证明了这一框架相对于现有方法的有效性。此外，我们还在植物、小鼠和人类的三个不同基因组数据集上测试了我们的方法，并讨论了我们通过模型发现的知识。

{"title":"Variable Selection in Heterogeneous Datasets: A Truncated-rank Sparse Linear Mixed Model with Applications to Genome-wide Association Studies.","authors":"Haohan Wang, Bryon Aragam, Eric P Xing","doi":"10.1109/BIBM.2017.8217687","DOIUrl":"10.1109/BIBM.2017.8217687","url":null,"abstract":"A fundamental and important challenge in modern datasets of ever increasing dimensionality is variable selection, which has taken on renewed interest recently due to the growth of biological and medical datasets with complex, non-i.i.d. structures. Naïvely applying classical variable selection methods such as the Lasso to such datasets may lead to a large number of false discoveries. Motivated by genome-wide association studies in genetics, we study the problem of variable selection for datasets arising from multiple subpopulations, when this underlying population structure is unknown to the researcher. We propose a unified framework for sparse variable selection that adaptively corrects for population structure via a low-rank linear mixed model. Most importantly, the proposed method does not require prior knowledge of individual relationships in the data and adaptively selects a covariance structure of the correct complexity. Through extensive experiments, we illustrate the effectiveness of this framework over existing methods. Further, we test our method on three different genomic datasets from plants, mice, and humans, and discuss the knowledge we discover with our model.","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2017 ","pages":"431-438"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5889139/pdf/nihms874620.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35986011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Data Integration through Ontology-Based Data Access to Support Integrative Data Analysis: A Case Study of Cancer Survival. 通过基于本体的数据访问进行数据集成以支持综合数据分析:癌症生存案例研究。

Proceedings. IEEE International Conference on Bioinformatics and Biomedicine

Pub Date : 2017-11-01 Epub Date: 2017-12-18 DOI: 10.1109/BIBM.2017.8217849

Hansi Zhang, Yi Guo, Qian Li, Thomas J George, Elizabeth A Shenkman, Jiang Bian

To improve cancer survival rates and prognosis, one of the first steps is to improve our understanding of contributory factors associated with cancer survival. Prior research has suggested that cancer survival is influenced by multiple factors from multiple levels. Most of existing analyses of cancer survival used data from a single source. Nevertheless, there are key challenges in integrating variables from different sources. Data integration is a daunting task because data from different sources can be heterogeneous in syntax, schema, and particularly semantics. Thus, we propose to adopt a semantic data integration approach that generates a universal conceptual representation of "information" including data and their relationships. This paper describes a case study of semantic data integration linking three data sets that cover both individual and contextual level factors for the purpose of assessing the association of the predictors of interest with cancer survival using cox proportional hazard models.

为了提高癌症存活率和预后，第一步是提高我们对癌症存活率相关因素的理解。先前的研究表明，癌症的生存受到来自多个层面的多种因素的影响。大多数现有的癌症生存分析使用单一来源的数据。然而，在整合来自不同来源的变量方面存在着关键的挑战。数据集成是一项艰巨的任务，因为来自不同来源的数据在语法、模式，特别是语义上可能是异构的。因此，我们建议采用语义数据集成方法，生成包括数据及其关系在内的“信息”的通用概念表示。本文描述了一个语义数据集成的案例研究，将三个数据集连接起来，这些数据集涵盖了个人和上下文水平的因素，目的是使用cox比例风险模型评估感兴趣的预测因子与癌症生存的关联。

{"title":"Data Integration through Ontology-Based Data Access to Support Integrative Data Analysis: A Case Study of Cancer Survival.","authors":"Hansi Zhang, Yi Guo, Qian Li, Thomas J George, Elizabeth A Shenkman, Jiang Bian","doi":"10.1109/BIBM.2017.8217849","DOIUrl":"https://doi.org/10.1109/BIBM.2017.8217849","url":null,"abstract":"To improve cancer survival rates and prognosis, one of the first steps is to improve our understanding of contributory factors associated with cancer survival. Prior research has suggested that cancer survival is influenced by multiple factors from multiple levels. Most of existing analyses of cancer survival used data from a single source. Nevertheless, there are key challenges in integrating variables from different sources. Data integration is a daunting task because data from different sources can be heterogeneous in syntax, schema, and particularly semantics. Thus, we propose to adopt a semantic data integration approach that generates a universal conceptual representation of \"information\" including data and their relationships. This paper describes a case study of semantic data integration linking three data sets that cover both individual and contextual level factors for the purpose of assessing the association of the predictors of interest with cancer survival using cox proportional hazard models.","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2017 ","pages":"1300-1303"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/BIBM.2017.8217849","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36054115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Auditing the Assignments of Top-Level Semantic Types in the UMLS Semantic Network to UMLS Concepts. 审计UMLS语义网络中顶层语义类型对UMLS概念的赋值。

Proceedings. IEEE International Conference on Bioinformatics and Biomedicine

Pub Date : 2017-11-01 Epub Date: 2017-12-18 DOI: 10.1109/BIBM.2017.8217840

Zhe He, Yehoshua Perl, Gai Elhanan, Yan Chen, James Geller, Jiang Bian

The Unified Medical Language System (UMLS) is an important terminological system. By the policy of its curators, each concept of the UMLS should be assigned the most specific Semantic Types (STs) in the UMLS Semantic Network (SN). Hence, the Semantic Types of most UMLS concepts are assigned at or near the bottom (leaves) of the UMLS Semantic Network. While most ST assignments are correct, some errors do occur. Therefore, Quality Assurance efforts of UMLS curators for ST assignments should concentrate on automatically detected sets of UMLS concepts with higher error rates than random sets. In this paper, we investigate the assignments of top-level semantic types in the UMLS semantic network to concepts, identify potential erroneous assignments, define four categories of errors, and thus provide assistance to curators of the UMLS to avoid these assignments errors. Human experts analyzed samples of concepts assigned 10 of the top-level semantic types and categorized the erroneous ST assignments into these four logical categories. Two thirds of the concepts assigned these 10 top-level semantic types are erroneous. Our results demonstrate that reviewing top-level semantic type assignments to concepts provides an effective way for UMLS quality assurance, comparing to reviewing a random selection of semantic type assignments.

医学统一语言系统(UMLS)是一个重要的术语系统。根据其管理者的政策，UMLS的每个概念都应该被分配到UMLS语义网络(SN)中最具体的语义类型(STs)。因此，大多数UMLS概念的语义类型被分配在UMLS语义网络的底部或接近底部(叶子)。虽然大多数ST赋值是正确的，但也会出现一些错误。因此，对于ST作业，UMLS管理员的质量保证工作应该集中在自动检测的错误率高于随机集的UMLS概念集上。在本文中，我们研究了UMLS语义网络中顶层语义类型对概念的分配，识别了潜在的错误分配，定义了四类错误，从而为UMLS的管理者提供了帮助，以避免这些分配错误。人类专家分析了分配给10种顶级语义类型的概念样本，并将错误的ST分配分为这四个逻辑类别。分配给这10个顶级语义类型的概念中有三分之二是错误的。我们的研究结果表明，与随机选择语义类型分配相比，审查概念的顶层语义类型分配为UMLS质量保证提供了一种有效的方法。

{"title":"Auditing the Assignments of Top-Level Semantic Types in the UMLS Semantic Network to UMLS Concepts.","authors":"Zhe He, Yehoshua Perl, Gai Elhanan, Yan Chen, James Geller, Jiang Bian","doi":"10.1109/BIBM.2017.8217840","DOIUrl":"https://doi.org/10.1109/BIBM.2017.8217840","url":null,"abstract":"The Unified Medical Language System (UMLS) is an important terminological system. By the policy of its curators, each concept of the UMLS should be assigned the most specific Semantic Types (STs) in the UMLS Semantic Network (SN). Hence, the Semantic Types of most UMLS concepts are assigned at or near the bottom (leaves) of the UMLS Semantic Network. While most ST assignments are correct, some errors do occur. Therefore, Quality Assurance efforts of UMLS curators for ST assignments should concentrate on automatically detected sets of UMLS concepts with higher error rates than random sets. In this paper, we investigate the assignments of top-level semantic types in the UMLS semantic network to concepts, identify potential erroneous assignments, define four categories of errors, and thus provide assistance to curators of the UMLS to avoid these assignments errors. Human experts analyzed samples of concepts assigned 10 of the top-level semantic types and categorized the erroneous ST assignments into these four logical categories. Two thirds of the concepts assigned these 10 top-level semantic types are erroneous. Our results demonstrate that reviewing top-level semantic type assignments to concepts provides an effective way for UMLS quality assurance, comparing to reviewing a random selection of semantic type assignments.","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2017 ","pages":"1262-1269"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/BIBM.2017.8217840","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35772366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Deep Gramulator: Improving Precision in the Classification of Personal Health-Experience Tweets with Deep Learning. 深度Gramulator:利用深度学习提高个人健康体验推文分类的精度。

Proceedings. IEEE International Conference on Bioinformatics and Biomedicine

Pub Date : 2017-11-01 Epub Date: 2017-12-18 DOI: 10.1109/BIBM.2017.8217820

Ricardo A Calix, Ravish Gupta, Matrika Gupta, Keyuan Jiang

Health surveillance is an important task to track the happenings related to human health, and one of its areas is pharmacovigilance. Pharmacovigilance tracks and monitors safe use of pharmaceutical products. Pharmacovigilance involves tracking side effects that may be caused by medicines and other health related drugs. Medical professionals have a difficult time collecting this information. It is anticipated that social media could help to collect this data and track side effects. Twitter data can be used for this task given that users post their personal health related experiences on-line. One problem with Twitter data, however, is that it contains a lot of noise. Therefore, an approach is needed to remove the noise. In this paper, several machine learning algorithms including deep neural nets are used to build classifiers that can help to detect these Personal Experience Tweets (PETs). Finally, we propose a method called the Deep Gramulator that improves results. Results of the analysis are presented and discussed.

卫生监测是跟踪与人类健康有关的事件的一项重要任务，其领域之一是药物警戒。药物警戒跟踪和监测药品的安全使用情况。药物警戒包括跟踪药物和其他与健康有关的药物可能引起的副作用。医疗专业人员很难收集到这些信息。预计社交媒体可以帮助收集这些数据并跟踪副作用。考虑到用户在线发布他们的个人健康相关经历，Twitter数据可以用于此任务。然而，Twitter数据的一个问题是它包含了很多噪音。因此，需要一种消除噪声的方法。在本文中，包括深度神经网络在内的几种机器学习算法被用于构建分类器，这些分类器可以帮助检测这些个人体验推文(pet)。最后，我们提出了一种称为Deep Gramulator的方法来改善结果。给出了分析结果并进行了讨论。

{"title":"Deep Gramulator: Improving Precision in the Classification of Personal Health-Experience Tweets with Deep Learning.","authors":"Ricardo A Calix, Ravish Gupta, Matrika Gupta, Keyuan Jiang","doi":"10.1109/BIBM.2017.8217820","DOIUrl":"https://doi.org/10.1109/BIBM.2017.8217820","url":null,"abstract":"Health surveillance is an important task to track the happenings related to human health, and one of its areas is pharmacovigilance. Pharmacovigilance tracks and monitors safe use of pharmaceutical products. Pharmacovigilance involves tracking side effects that may be caused by medicines and other health related drugs. Medical professionals have a difficult time collecting this information. It is anticipated that social media could help to collect this data and track side effects. Twitter data can be used for this task given that users post their personal health related experiences on-line. One problem with Twitter data, however, is that it contains a lot of noise. Therefore, an approach is needed to remove the noise. In this paper, several machine learning algorithms including deep neural nets are used to build classifiers that can help to detect these Personal Experience Tweets (PETs). Finally, we propose a method called the Deep Gramulator that improves results. Results of the analysis are presented and discussed.","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2017 ","pages":"1154-1159"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/BIBM.2017.8217820","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36286319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Deep Learning-based MSMS Spectra Reduction in Support of Running Multiple Protein Search Engines on Cloud. 基于深度学习的 MSMS 光谱缩减，支持在云上运行多个蛋白质搜索引擎。

Proceedings. IEEE International Conference on Bioinformatics and Biomedicine

Pub Date : 2017-11-01 Epub Date: 2017-12-18 DOI: 10.1109/bibm.2017.8217951

Majdi Maabreh, Basheer Qolomany, Izzat Alsmadi, Ajay Gupta

The diversity of the available protein search engines with respect to the utilized matching algorithms, the low overlap ratios among their results and the disparity of their coverage encourage the community of proteomics to utilize ensemble solutions of different search engines. The advancing in cloud computing technology and the availability of distributed processing clusters can also provide support to this task. However, data transferring and results' combining, in this case, could be the major bottleneck. The flood of billions of observed mass spectra, hundreds of Gigabytes or potentially Terabytes of data, could easily cause the congestions, increase the risk of failure, poor performance, add more computations' cost, and waste available resources. Therefore, in this study, we propose a deep learning model in order to mitigate the traffic over cloud network and, thus reduce the cost of cloud computing. The model, which depends on the top 50 intensities and their m/z values of each spectrum, removes any spectrum which is predicted not to pass the majority voting of the participated search engines. Our results using three search engines namely: pFind, Comet and X!Tandem, and four different datasets are promising and promote the investment in deep learning to solve such type of Big data problems.

现有的蛋白质搜索引擎在所使用的匹配算法方面存在多样性，其结果的重叠率较低，覆盖范围也不尽相同，这促使蛋白质组学界利用不同搜索引擎的集合解决方案。云计算技术的发展和分布式处理集群的可用性也为这项任务提供了支持。然而，在这种情况下，数据传输和结果合并可能是主要瓶颈。数十亿条观测到的质谱数据、数百 GB 甚至数 TB 的数据洪流很容易造成拥塞，增加故障风险，降低性能，增加计算成本，浪费可用资源。因此，在本研究中，我们提出了一种深度学习模型，以减轻云网络的流量，从而降低云计算的成本。该模型依赖于每个频谱的前 50 个强度及其 m/z 值，删除任何预测不会通过参与搜索引擎多数投票的频谱。我们使用三个搜索引擎（即 pFind、Comet 和 X！Tandem）和四个不同的数据集得出的结果很有前景，促进了对深度学习的投资，以解决此类大数据问题。

{"title":"Deep Learning-based MSMS Spectra Reduction in Support of Running Multiple Protein Search Engines on Cloud.","authors":"Majdi Maabreh, Basheer Qolomany, Izzat Alsmadi, Ajay Gupta","doi":"10.1109/bibm.2017.8217951","DOIUrl":"10.1109/bibm.2017.8217951","url":null,"abstract":"The diversity of the available protein search engines with respect to the utilized matching algorithms, the low overlap ratios among their results and the disparity of their coverage encourage the community of proteomics to utilize ensemble solutions of different search engines. The advancing in cloud computing technology and the availability of distributed processing clusters can also provide support to this task. However, data transferring and results' combining, in this case, could be the major bottleneck. The flood of billions of observed mass spectra, hundreds of Gigabytes or potentially Terabytes of data, could easily cause the congestions, increase the risk of failure, poor performance, add more computations' cost, and waste available resources. Therefore, in this study, we propose a deep learning model in order to mitigate the traffic over cloud network and, thus reduce the cost of cloud computing. The model, which depends on the top 50 intensities and their m/z values of each spectrum, removes any spectrum which is predicted not to pass the majority voting of the participated search engines. Our results using three search engines namely: pFind, Comet and X!Tandem, and four different datasets are promising and promote the investment in deep learning to solve such type of Big data problems.","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2017 ","pages":"1909-1914"},"PeriodicalIF":0.0,"publicationDate":"2017-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8382039/pdf/nihms-1728667.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39355075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0