Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics最新文献_第8页

Modularity Analysis of Bipartite Networks and Multivariate ANOVA for Identification of Differentially Expressed Proteins in a Mouse Model of Down Syndrome 识别唐氏综合症小鼠模型中差异表达蛋白的二元网络模块性分析和多元方差分析

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3412421

A. Jazayeri, Sara Pajouhanfar, Sadaf Saba, Christopher C. Yang

Down Syndrome (DS) is one of the most common disorders caused by the presence of an extra copy of chromosome 21. It has been shown that the expression of various genes located on chromosomes other than the extra 21 chromosomes is affected in DS. Given the practical and ethical difficulties in human tissue studies, the Ts65Dn mouse model has been widely used in DS research. In this study, we propose a pipeline composed of a supervised learning approach, modularity analysis of a bipartite network, and multivariate analysis of variance (MANOVA), for identification of differentially expressed proteins (DEP) among different classes of mice models. The proposed pipeline is tested using the expression levels of 77 proteins in eight different classes of mice models. The data includes the protein expression measurements for 34 trisomic Ts65Dn and 38 control mice. Each group is broken up into four classes based on either being stimulated for learning or not, each injected with memantine or saline. The previously proposed approaches have been unable to identify DEP among all of the eight classes simultaneously. Here, we show that our proposed pipeline can successfully identify the set of proteins expressed differently among all the eight classes. The findings of this study can inform the study of learning responses to different treatments and protein-treatment associations in DS. Also, the proposed pipeline can be adopted to identify DEP in DS or other diseases and health conditions, which can consequently inform the development of improved personalized treatment and management strategies.

唐氏综合症(DS)是由21号染色体的额外拷贝引起的最常见的疾病之一。研究表明，除额外的21条染色体外，位于染色体上的各种基因的表达在DS中受到影响。鉴于人体组织研究的现实和伦理困难，Ts65Dn小鼠模型被广泛应用于退行性椎体滑移研究。在这项研究中，我们提出了一个由监督学习方法、二部网络的模块化分析和多变量方差分析(MANOVA)组成的管道，用于识别不同类别小鼠模型中的差异表达蛋白(DEP)。在8种不同类型的小鼠模型中，使用77种蛋白质的表达水平来测试拟议的管道。数据包括34只三体Ts65Dn小鼠和38只对照小鼠的蛋白表达测量。每一组根据是否受到学习刺激分为四组，每组注射美金刚或生理盐水。先前提出的方法无法同时在所有八类中识别DEP。在这里，我们证明了我们提出的管道可以成功地识别出在所有8类中表达不同的一组蛋白质。本研究结果可为研究退行性椎体滑移对不同治疗和蛋白质治疗相关性的学习反应提供信息。此外，拟议的管道可用于识别DS或其他疾病和健康状况中的DEP，从而可以为改进的个性化治疗和管理策略的发展提供信息。

{"title":"Modularity Analysis of Bipartite Networks and Multivariate ANOVA for Identification of Differentially Expressed Proteins in a Mouse Model of Down Syndrome","authors":"A. Jazayeri, Sara Pajouhanfar, Sadaf Saba, Christopher C. Yang","doi":"10.1145/3388440.3412421","DOIUrl":"https://doi.org/10.1145/3388440.3412421","url":null,"abstract":"Down Syndrome (DS) is one of the most common disorders caused by the presence of an extra copy of chromosome 21. It has been shown that the expression of various genes located on chromosomes other than the extra 21 chromosomes is affected in DS. Given the practical and ethical difficulties in human tissue studies, the Ts65Dn mouse model has been widely used in DS research. In this study, we propose a pipeline composed of a supervised learning approach, modularity analysis of a bipartite network, and multivariate analysis of variance (MANOVA), for identification of differentially expressed proteins (DEP) among different classes of mice models. The proposed pipeline is tested using the expression levels of 77 proteins in eight different classes of mice models. The data includes the protein expression measurements for 34 trisomic Ts65Dn and 38 control mice. Each group is broken up into four classes based on either being stimulated for learning or not, each injected with memantine or saline. The previously proposed approaches have been unable to identify DEP among all of the eight classes simultaneously. Here, we show that our proposed pipeline can successfully identify the set of proteins expressed differently among all the eight classes. The findings of this study can inform the study of learning responses to different treatments and protein-treatment associations in DS. Also, the proposed pipeline can be adopted to identify DEP in DS or other diseases and health conditions, which can consequently inform the development of improved personalized treatment and management strategies.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126247530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RA2Vec

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414925

Rajitha Yasas Wijesekara, Ashwin Lahorkar, Kunal Rathore, J. Valadi

Protein Function identification has become an important task due to a plethora of new genomes being sequenced. Recently, distributed representation [1] of words in the form of continuous vector representations has been found to be a very efficient way to represent semantic/syntactic information. In this representation, each word is embedded in an n- dimensional space with similar words having proximate vectors in the embedding space. In the popular skip-gram configuration, the current word is used by the model to predict its surrounding words. In this work we introduce reduced amino acid alphabets based, distributed representation for protein sequences. In our RA2Vec (Reduced Alphabets to Vectors) implementation we first map all Swiss-Prot sequences to hydropathy and conformational similarity based reduced form. Further, by employing skip-gram based method, reduced alphabets embedding vectors (RA2Vec) were created for each set. Embedding vectors for sequences with original ProtVec representation [2] were also created. These vectors were created for various combinations of K-grams and vector sizes. All seven combinations of the original ProtVec embedding vectors, Hydropathy based embedding vectors and Conformational Similarity based embedding vectors were then employed as input to Support Vector Machines classifiers and classification models were built. The embedding vectors were further reduced using recursive Feature Elimination (RFE) method to maximize fivefold CV accuracy. We assessed the validity and the utility of the new representations employing five different data sets. Our results with all data sets indicate, certain synergistic combinations of new representations with and without ProtVec embedding can result in significantly improved performance.

{"title":"RA2Vec","authors":"Rajitha Yasas Wijesekara, Ashwin Lahorkar, Kunal Rathore, J. Valadi","doi":"10.1145/3388440.3414925","DOIUrl":"https://doi.org/10.1145/3388440.3414925","url":null,"abstract":"Protein Function identification has become an important task due to a plethora of new genomes being sequenced. Recently, distributed representation [1] of words in the form of continuous vector representations has been found to be a very efficient way to represent semantic/syntactic information. In this representation, each word is embedded in an n- dimensional space with similar words having proximate vectors in the embedding space. In the popular skip-gram configuration, the current word is used by the model to predict its surrounding words. In this work we introduce reduced amino acid alphabets based, distributed representation for protein sequences. In our RA2Vec (Reduced Alphabets to Vectors) implementation we first map all Swiss-Prot sequences to hydropathy and conformational similarity based reduced form. Further, by employing skip-gram based method, reduced alphabets embedding vectors (RA2Vec) were created for each set. Embedding vectors for sequences with original ProtVec representation [2] were also created. These vectors were created for various combinations of K-grams and vector sizes. All seven combinations of the original ProtVec embedding vectors, Hydropathy based embedding vectors and Conformational Similarity based embedding vectors were then employed as input to Support Vector Machines classifiers and classification models were built. The embedding vectors were further reduced using recursive Feature Elimination (RFE) method to maximize fivefold CV accuracy. We assessed the validity and the utility of the new representations employing five different data sets. Our results with all data sets indicate, certain synergistic combinations of new representations with and without ProtVec embedding can result in significantly improved performance.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114045246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Integrative Deep Learning for PanCancer Molecular Subtype Classification Using Histopathological Images and RNAseq Data 基于组织病理图像和RNAseq数据的胰腺癌分子亚型分类整合深度学习

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3412414

Fatima Zare, J. Noorbakhsh, Tianyu Wang, Jeffrey H. Chuang, S. Nabavi

Deep learning has recently become a key methodology for the study and interpretation of cancer histology images. The ability of convolutional neural networks (CNNs) to automatically learn features from raw data without the need for pathologist expert knowledge, as well as the availability of annotated histopathology datasets, have contributed to a growing interest in deep learning applications to histopathology. In clinical practice for cancer, histopathological images have been commonly used for diagnosis, prognosis, and treatment. Recently, molecular subtype classification has gained significant attention for predicting standard chemotherapy's outcomes and creating personalized targeted cancer therapy. Genomic profiles, especially gene expression data, are mostly used for molecular subtyping. In this study, we developed a novel, PanCancer CNN model based on Google Inception V3 transfer learning to classify molecular subtypes using histopathological images. We used 22,484 Haemotoxylin and Eosin (H&E) slides from 32 cancer types provided by The Cancer Genome Atlas (TCGA) to train and evaluate the model. We showed that by employing deep learning, H&E slides can be used for classification of molecular subtypes of solid tumor samples with the high area under curves (AUCs) (micro-average= 0.90; macro-average=0.90). In cancer studies, combining histopathological images with genomic data has rarely been explored. We investigated the relationship between features extracted from H&E images and features extracted from gene expression profiles. We observed that the features from these two different modalities (H&E images and gene expression values) for molecular subtyping are highly correlated. We, therefore, developed an integrative deep learning model that combines histological images and gene expression profiles. We showed that the integrative model improves the overall performance of the molecular subtypes classification ((AUCs) micro-average= 0.99; macro-average=0.97). These results show that integrating H&E images and gene expression profiles can enhance accuracy of molecular subtype classification.

近年来，深度学习已成为研究和解释癌症组织学图像的关键方法。卷积神经网络(cnn)在不需要病理学家专家知识的情况下从原始数据中自动学习特征的能力，以及注释组织病理学数据集的可用性，使得人们对深度学习在组织病理学中的应用越来越感兴趣。在癌症的临床实践中，组织病理学图像通常用于诊断、预后和治疗。近年来，分子亚型分类在预测标准化疗结果和创建个性化靶向癌症治疗方面受到了极大的关注。基因组图谱，尤其是基因表达数据，主要用于分子分型。在这项研究中，我们基于Google Inception V3迁移学习开发了一种新颖的PanCancer CNN模型，利用组织病理学图像对分子亚型进行分类。我们使用来自癌症基因组图谱(TCGA)提供的32种癌症类型的22,484张血红素和伊红(H&E)幻灯片来训练和评估模型。我们发现，通过深度学习，H&E切片可以用于具有高曲线下面积(aus)的实体肿瘤样本的分子亚型分类(微平均= 0.90;macro-average = 0.90)。在癌症研究中，很少探索将组织病理学图像与基因组数据相结合。我们研究了从H&E图像中提取的特征与从基因表达谱中提取的特征之间的关系。我们观察到这两种不同模式(H&E图像和基因表达值)的分子分型特征是高度相关的。因此，我们开发了一种结合组织学图像和基因表达谱的综合深度学习模型。结果表明，整合模型提高了分子亚型分类的整体性能((aus)微平均= 0.99;macro-average = 0.97)。这些结果表明，将H&E图像与基因表达谱相结合可以提高分子亚型分类的准确性。

{"title":"Integrative Deep Learning for PanCancer Molecular Subtype Classification Using Histopathological Images and RNAseq Data","authors":"Fatima Zare, J. Noorbakhsh, Tianyu Wang, Jeffrey H. Chuang, S. Nabavi","doi":"10.1145/3388440.3412414","DOIUrl":"https://doi.org/10.1145/3388440.3412414","url":null,"abstract":"Deep learning has recently become a key methodology for the study and interpretation of cancer histology images. The ability of convolutional neural networks (CNNs) to automatically learn features from raw data without the need for pathologist expert knowledge, as well as the availability of annotated histopathology datasets, have contributed to a growing interest in deep learning applications to histopathology. In clinical practice for cancer, histopathological images have been commonly used for diagnosis, prognosis, and treatment. Recently, molecular subtype classification has gained significant attention for predicting standard chemotherapy's outcomes and creating personalized targeted cancer therapy. Genomic profiles, especially gene expression data, are mostly used for molecular subtyping. In this study, we developed a novel, PanCancer CNN model based on Google Inception V3 transfer learning to classify molecular subtypes using histopathological images. We used 22,484 Haemotoxylin and Eosin (H&E) slides from 32 cancer types provided by The Cancer Genome Atlas (TCGA) to train and evaluate the model. We showed that by employing deep learning, H&E slides can be used for classification of molecular subtypes of solid tumor samples with the high area under curves (AUCs) (micro-average= 0.90; macro-average=0.90). In cancer studies, combining histopathological images with genomic data has rarely been explored. We investigated the relationship between features extracted from H&E images and features extracted from gene expression profiles. We observed that the features from these two different modalities (H&E images and gene expression values) for molecular subtyping are highly correlated. We, therefore, developed an integrative deep learning model that combines histological images and gene expression profiles. We showed that the integrative model improves the overall performance of the molecular subtypes classification ((AUCs) micro-average= 0.99; macro-average=0.97). These results show that integrating H&E images and gene expression profiles can enhance accuracy of molecular subtype classification.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115506470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Using player generated data to elucidate molecular docking 使用玩家生成的数据来阐明分子对接

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414704

Torin Adamson, Selina M. Bauernfeind, Bruna Jacobson, Lydia Tapia

Molecular docking prediction, used to design new drugs and identify biological function, can be represented as a motion planning problem in high dimensional space. This complex task can benefit from human guidance, i.e., interactive simulations combining visual and haptic feedback from atomic forces [4]. Using human-generated data to find solutions to complex problems has been used in protein folding with FoldIt [3], showing that crowdsourced data collection is viable to help solve difficult problems in biology. However, existing interactive molecular docking systems that typically use haptic devices employ devices that are not widely available to the public, limiting the potential to crowdsource solutions [1]. This user study tested 4 input devices to find potentially bound ligand-receptor states and biologically feasible paths via motion planning. On a PC laptop, players used one of the following: a 6 degree-of-freedom (DOF) haptic feedback device, a 3 DOF haptic feedback device, a game controller with vibration feedback, and a mouse and keyboard, with our in-house interactive rigid body molecular docking program, DockAnywhere [1, 2]. Players tried to dock a known inhibitor of HIV Protease (PDB ID 1AJX). The goal is to move the ligand around the receptor to find the lowest potential energy state possible. Players get feedback as a positive integer score based on the potential energy and, for haptic devices, a force feedback calculated from atomic forces. During gameplay, ligand positions and orientations are recorded as it is moved through the environment. In total, 439,832 states were collected from 32 players (8 per device). These states were divided into 4 sets, one per device. and Motion planning queries found a feasible path to the goal state in all sets, with the mouse showing more pronounced energy barriers. We found that exploration varied among different devices, with players on haptic devices exploring the interaction energy landscape more uniformly. However, all 4 devices yielded low energy ligand states with comparable values and similar closest distance to the binding site. In summary, force feedback showed no clear improvement to finding low potential energy states or getting closer to the known binding state. However, haptic devices appear to enable a more thorough exploration of the state space, creating more samples and generating smoother paths. The nature of this guidance should be the subject of future investigation.

分子对接预测用于新药设计和生物功能识别，可以表示为高维空间中的运动规划问题。这项复杂的任务可以受益于人类的指导，即结合原子力的视觉和触觉反馈的交互式模拟[4]。利用人工生成的数据寻找复杂问题的解决方案已在FoldIt的蛋白质折叠中得到应用[3]，这表明众包数据收集在帮助解决生物学难题方面是可行的。然而，现有的交互式分子对接系统通常使用触觉设备，这些设备并未广泛向公众提供，这限制了众包解决方案的潜力[1]。本用户研究测试了4种输入设备，以通过运动规划找到潜在的结合配体-受体状态和生物学上可行的路径。在PC笔记本电脑上，玩家使用以下设备之一:6自由度(DOF)触觉反馈设备，3自由度触觉反馈设备，带有振动反馈的游戏控制器，鼠标和键盘，以及我们内部的交互式刚体分子对接程序DockAnywhere[1,2]。玩家试图对接一种已知的HIV蛋白酶抑制剂(PDB ID 1AJX)。目标是移动受体周围的配体，以找到尽可能低的势能状态。玩家得到的反馈是基于势能的正整数分数，而对于触觉设备，则是根据原子力计算的力反馈。在游戏过程中，配体的位置和方向会随着它在环境中的移动而被记录下来。总共从32个玩家(每台设备8个)中收集了439,832个州。这些状态被分为4组，每个设备一个。运动规划查询在所有集合中都找到了通往目标状态的可行路径，鼠标显示出更明显的能量障碍。我们发现不同设备的探索方式不同，使用触觉设备的玩家对互动能量的探索更加一致。然而，这4种装置产生的低能配体态具有相似的值和与结合位点的最近距离。综上所述，力反馈在寻找低势能态或接近已知结合态方面没有明显的改善。然而，触觉设备似乎能够更彻底地探索状态空间，创建更多的样本并生成更平滑的路径。这一指导方针的性质应成为今后调查的主题。

{"title":"Using player generated data to elucidate molecular docking","authors":"Torin Adamson, Selina M. Bauernfeind, Bruna Jacobson, Lydia Tapia","doi":"10.1145/3388440.3414704","DOIUrl":"https://doi.org/10.1145/3388440.3414704","url":null,"abstract":"Molecular docking prediction, used to design new drugs and identify biological function, can be represented as a motion planning problem in high dimensional space. This complex task can benefit from human guidance, i.e., interactive simulations combining visual and haptic feedback from atomic forces [4]. Using human-generated data to find solutions to complex problems has been used in protein folding with FoldIt [3], showing that crowdsourced data collection is viable to help solve difficult problems in biology. However, existing interactive molecular docking systems that typically use haptic devices employ devices that are not widely available to the public, limiting the potential to crowdsource solutions [1]. This user study tested 4 input devices to find potentially bound ligand-receptor states and biologically feasible paths via motion planning. On a PC laptop, players used one of the following: a 6 degree-of-freedom (DOF) haptic feedback device, a 3 DOF haptic feedback device, a game controller with vibration feedback, and a mouse and keyboard, with our in-house interactive rigid body molecular docking program, DockAnywhere [1, 2]. Players tried to dock a known inhibitor of HIV Protease (PDB ID 1AJX). The goal is to move the ligand around the receptor to find the lowest potential energy state possible. Players get feedback as a positive integer score based on the potential energy and, for haptic devices, a force feedback calculated from atomic forces. During gameplay, ligand positions and orientations are recorded as it is moved through the environment. In total, 439,832 states were collected from 32 players (8 per device). These states were divided into 4 sets, one per device. and Motion planning queries found a feasible path to the goal state in all sets, with the mouse showing more pronounced energy barriers. We found that exploration varied among different devices, with players on haptic devices exploring the interaction energy landscape more uniformly. However, all 4 devices yielded low energy ligand states with comparable values and similar closest distance to the binding site. In summary, force feedback showed no clear improvement to finding low potential energy states or getting closer to the known binding state. However, haptic devices appear to enable a more thorough exploration of the state space, creating more samples and generating smoother paths. The nature of this guidance should be the subject of future investigation.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125438905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automated Classification of Acute Rejection from Endomyocardial Biopsies 心肌内膜活检急性排斥反应的自动分类

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3412430

F. Giuste, M. Venkatesan, Conan Y. Zhao, L. Tong, Yuanda Zhu, S. Deshpande, May D. Wang

Heart transplant rejection must be quickly and accurately identified to optimize anti-rejection therapies and prevent organ loss. Expert evaluation of endomyocardial biopsies is labor-intensive, and prone to human bias, and suffers from low inter-rater agreement. Additionally, the increased utility of digital pathology for biopsy examination has exacerbated the need for additional image quality control. To meet these challenges, we developed a novel transplant rejection detection pipeline which automatically identifies histology slides in need of rescanning and highlights biopsy regions showing potential signs of rejection. Our system leverages a fast and effective automated patch-level quality filter as well as state-of-the-art feature extraction techniques to provide quality whole-slide level labeling of early rejection signs. We successfully identified digital pathology images with poor image quality and leveraged this quality gain to improve our novel weakly-supervised learning model leading to significant transplant rejection classification performance of AUC: 70.12 (±20.74) %.

心脏移植排斥反应必须快速准确地识别，以优化抗排斥治疗和防止器官损失。心内膜活检的专家评估是劳动密集型的，容易受到人为偏见的影响，并且在评估者之间的一致性很低。此外，数字病理学在活检检查中的应用增加了对额外图像质量控制的需求。为了应对这些挑战，我们开发了一种新的移植排斥检测管道，可以自动识别需要重新扫描的组织学切片，并突出显示显示潜在排斥迹象的活检区域。我们的系统利用快速有效的自动化贴片级质量过滤器以及最先进的特征提取技术，提供高质量的全片级早期排斥信号标记。我们成功地识别了图像质量较差的数字病理图像，并利用这种质量增益来改进我们的新型弱监督学习模型，从而使移植排斥分类的AUC达到了70.12(±20.74)%。

{"title":"Automated Classification of Acute Rejection from Endomyocardial Biopsies","authors":"F. Giuste, M. Venkatesan, Conan Y. Zhao, L. Tong, Yuanda Zhu, S. Deshpande, May D. Wang","doi":"10.1145/3388440.3412430","DOIUrl":"https://doi.org/10.1145/3388440.3412430","url":null,"abstract":"Heart transplant rejection must be quickly and accurately identified to optimize anti-rejection therapies and prevent organ loss. Expert evaluation of endomyocardial biopsies is labor-intensive, and prone to human bias, and suffers from low inter-rater agreement. Additionally, the increased utility of digital pathology for biopsy examination has exacerbated the need for additional image quality control. To meet these challenges, we developed a novel transplant rejection detection pipeline which automatically identifies histology slides in need of rescanning and highlights biopsy regions showing potential signs of rejection. Our system leverages a fast and effective automated patch-level quality filter as well as state-of-the-art feature extraction techniques to provide quality whole-slide level labeling of early rejection signs. We successfully identified digital pathology images with poor image quality and leveraged this quality gain to improve our novel weakly-supervised learning model leading to significant transplant rejection classification performance of AUC: 70.12 (±20.74) %.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132088949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Smart Computational Approaches with Advanced Feature Selection Algorithms for Optimizing the Classification of Mobility Data in Health Informatics 基于高级特征选择算法的智能计算方法在健康信息学中优化移动数据分类

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3412426

E. Rastegari, D. Orn, H. Ali

Recently, wearable mobility monitoring devices have gained a great deal of attention for collecting movement and gait-related data. Moreover, Wearable movement monitoring devices together with machine learning techniques have been shown to be successful in a variety of healthcare applications, including diagnosis, prognosis, and rehabilitation. However, advanced studies are needed to create accurate and robust models that can differentiate between different populations based on their mobility signatures. This is particularly critical for monitoring movement and gait patterns of individuals impacted by neurodegenerative conditions such as Parkinson's Disease (PD). In order to achieve this goal, it is critical to employ a robust approach to model available data and identify the optimal set of movement parameters for the classification process. In this work, we propose a computational approach to identify the best feature selection method for spatiotemporal gait parameters. We investigate several feature selection approaches and analyze their performance as related to the mobility classification problem; including maximum information gain with minimum correlation (MIGMC), maximum signal to noise ratio with minimum correlation (MSNR&MC), genetic algorithms (GA), decision trees (DT) and principal component analysis (PCA). These methods, along with new proposed variations, are assessed in terms of classification accuracy, the number of selected features, and computation time. Data collected from the triaxial accelerometers attached to the ankles of individuals with PD, geriatrics (GE), and healthy elderly (HE) were used to train and test a set of six different machine learning techniques. Our results indicate that three out of six feature selection methods, including GA, MSNR&MC, and a modified version of MIGMC are the best performers regarding the classification accuracy. We also show that higher degrees of robust performances are achieved when employing multiple algorithms, such as decision trees and genetic algorithms. This study provides a critical first step towards the much-needed goal of utilizing data collected from wearable devices to extract important information for the diagnosis and rehabilitation of many movement-related medical conditions.

近年来，可穿戴式移动监测设备因收集运动和步态相关数据而受到广泛关注。此外，可穿戴运动监测设备与机器学习技术已被证明在各种医疗保健应用中取得了成功，包括诊断、预后和康复。然而，需要更深入的研究来创建准确而稳健的模型，以便根据不同种群的流动性特征来区分它们。这对于监测受神经退行性疾病(如帕金森病)影响的个体的运动和步态模式尤其重要。为了实现这一目标，至关重要的是采用一种鲁棒的方法来对可用数据进行建模，并确定分类过程的最佳运动参数集。在这项工作中，我们提出了一种计算方法来识别时空步态参数的最佳特征选择方法。我们研究了几种特征选择方法，并分析了它们与移动性分类问题相关的性能;包括最小相关的最大信息增益(MIGMC)、最小相关的最大信噪比(MSNR&MC)、遗传算法(GA)、决策树(DT)和主成分分析(PCA)。这些方法，连同新的提出的变化，在分类精度，选择的特征的数量和计算时间方面进行评估。从PD患者、老年患者(GE)和健康老年人(HE)脚踝上的三轴加速度计收集的数据用于训练和测试一组六种不同的机器学习技术。我们的结果表明，六种特征选择方法中的三种，包括GA, MSNR&MC和改进版本的MIGMC，在分类精度方面表现最好。我们还表明，当采用多种算法(如决策树和遗传算法)时，可以获得更高程度的鲁棒性能。这项研究为利用从可穿戴设备收集的数据来提取许多与运动相关的医疗条件的诊断和康复的重要信息的急需目标提供了关键的第一步。

{"title":"Smart Computational Approaches with Advanced Feature Selection Algorithms for Optimizing the Classification of Mobility Data in Health Informatics","authors":"E. Rastegari, D. Orn, H. Ali","doi":"10.1145/3388440.3412426","DOIUrl":"https://doi.org/10.1145/3388440.3412426","url":null,"abstract":"Recently, wearable mobility monitoring devices have gained a great deal of attention for collecting movement and gait-related data. Moreover, Wearable movement monitoring devices together with machine learning techniques have been shown to be successful in a variety of healthcare applications, including diagnosis, prognosis, and rehabilitation. However, advanced studies are needed to create accurate and robust models that can differentiate between different populations based on their mobility signatures. This is particularly critical for monitoring movement and gait patterns of individuals impacted by neurodegenerative conditions such as Parkinson's Disease (PD). In order to achieve this goal, it is critical to employ a robust approach to model available data and identify the optimal set of movement parameters for the classification process. In this work, we propose a computational approach to identify the best feature selection method for spatiotemporal gait parameters. We investigate several feature selection approaches and analyze their performance as related to the mobility classification problem; including maximum information gain with minimum correlation (MIGMC), maximum signal to noise ratio with minimum correlation (MSNR&MC), genetic algorithms (GA), decision trees (DT) and principal component analysis (PCA). These methods, along with new proposed variations, are assessed in terms of classification accuracy, the number of selected features, and computation time. Data collected from the triaxial accelerometers attached to the ankles of individuals with PD, geriatrics (GE), and healthy elderly (HE) were used to train and test a set of six different machine learning techniques. Our results indicate that three out of six feature selection methods, including GA, MSNR&MC, and a modified version of MIGMC are the best performers regarding the classification accuracy. We also show that higher degrees of robust performances are achieved when employing multiple algorithms, such as decision trees and genetic algorithms. This study provides a critical first step towards the much-needed goal of utilizing data collected from wearable devices to extract important information for the diagnosis and rehabilitation of many movement-related medical conditions.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131972175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Interpretable Molecule Generation via Disentanglement Learning 解纠缠学习的可解释分子生成

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414709

Yuanqi Du, Xiaojie Guo, Amarda Shehu, Liang Zhao

Designing molecules with specific structural and functional properties (e.g., drug-likeness and water solubility) is central to advancing drug discovery and material science, but it poses outstanding challenges both in wet and dry laboratories. The search space is vast and rugged. Recent advances in deep generative models are motivating new computational approaches building over deep learning to tackle the molecular space. Despite rapid advancements, state-of-the-art deep generative models for molecule generation have many limitations, including lack of interpretability. In this paper we address this limitation by proposing a generic framework for interpretable molecule generation based on novel disentangled deep graph generative models with property control. Specifically, we propose a disentanglement enhancement strategy for graphs. We also propose new deep neural architecture to achieve the above learning objective for inference and generation for variable-size graphs efficiently. Extensive experimental evaluation demonstrates the superiority of our approach in various critical aspects, such as accuracy, novelty, and disentanglement.

设计具有特定结构和功能特性的分子(例如，药物相似性和水溶性)是推进药物发现和材料科学的核心，但它在干湿实验室中都提出了突出的挑战。搜索空间巨大而崎岖。深度生成模型的最新进展正在推动在深度学习基础上建立新的计算方法来解决分子空间问题。尽管进展迅速，最先进的分子生成深度生成模型有许多局限性，包括缺乏可解释性。在本文中，我们通过提出一个基于具有属性控制的新型解纠缠深度图生成模型的可解释分子生成的通用框架来解决这一限制。具体来说，我们提出了一种图的解纠缠增强策略。我们还提出了一种新的深度神经结构，以有效地实现对变大小图进行推理和生成的学习目标。广泛的实验评估证明了我们的方法在各个关键方面的优势，例如准确性，新颖性和解纠缠。

{"title":"Interpretable Molecule Generation via Disentanglement Learning","authors":"Yuanqi Du, Xiaojie Guo, Amarda Shehu, Liang Zhao","doi":"10.1145/3388440.3414709","DOIUrl":"https://doi.org/10.1145/3388440.3414709","url":null,"abstract":"Designing molecules with specific structural and functional properties (e.g., drug-likeness and water solubility) is central to advancing drug discovery and material science, but it poses outstanding challenges both in wet and dry laboratories. The search space is vast and rugged. Recent advances in deep generative models are motivating new computational approaches building over deep learning to tackle the molecular space. Despite rapid advancements, state-of-the-art deep generative models for molecule generation have many limitations, including lack of interpretability. In this paper we address this limitation by proposing a generic framework for interpretable molecule generation based on novel disentangled deep graph generative models with property control. Specifically, we propose a disentanglement enhancement strategy for graphs. We also propose new deep neural architecture to achieve the above learning objective for inference and generation for variable-size graphs efficiently. Extensive experimental evaluation demonstrates the superiority of our approach in various critical aspects, such as accuracy, novelty, and disentanglement.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130601349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Joint Grid Discretization for Biological Pattern Discovery 生物模式发现的联合网格离散化

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3412415

Jiandong Wang, Sajal Kumar, Mingzhou Song

The complexity, dynamics, and scale of data acquired by modern biotechnology increasingly favor model-free computational methods that make minimal assumptions about underlying biological mechanisms. For example, single-cell transcriptome and proteome data have a throughput several orders more than bulk methods. Many model-free statistical methods for pattern discovery such as mutual information and chi-squared tests, however, require discrete data. Most discretization methods minimize squared errors for each variable independently, not necessarily retaining joint patterns. To address this issue, we present a joint grid discretization algorithm that preserves clusters in the original data. We evaluated this algorithm on simulated data to show its advantage over other methods in maintaining clusters as measured by the adjusted Rand index. We also show it promotes global functional patterns over independent patterns. On single-cell proteome and transcriptome of leukemia and healthy blood, joint grid discretization captured known protein-to-RNA regulatory relationships, while revealing previously unknown interactions. As such, the joint grid discretization is applicable as a data transformation step in associative, functional, and causal inference of molecular interactions fundamental to systems biology. The developed software is publicly available at https://cran.r-project.org/package=GridOnClusters

现代生物技术所获得的数据的复杂性、动态性和规模越来越倾向于对潜在生物机制做出最小假设的无模型计算方法。例如，单细胞转录组和蛋白质组数据的吞吐量比批量方法高几个数量级。然而，许多用于模式发现的无模型统计方法(如互信息和卡方检验)需要离散数据。大多数离散化方法使每个变量的平方误差最小，而不一定保留联合模式。为了解决这个问题，我们提出了一种联合网格离散化算法，该算法保留了原始数据中的聚类。我们在模拟数据上对该算法进行了评估，以显示其在通过调整后的Rand指数衡量的维持集群方面优于其他方法的优势。我们还表明，它促进了全局功能模式而不是独立模式。在白血病和健康血液的单细胞蛋白质组和转录组中，联合网格离散捕获了已知的蛋白质- rna调节关系，同时揭示了以前未知的相互作用。因此，联合网格离散化适用于系统生物学基础的分子相互作用的联想、功能和因果推理的数据转换步骤。开发的软件可在https://cran.r-project.org/package=GridOnClusters上公开获得

{"title":"Joint Grid Discretization for Biological Pattern Discovery","authors":"Jiandong Wang, Sajal Kumar, Mingzhou Song","doi":"10.1145/3388440.3412415","DOIUrl":"https://doi.org/10.1145/3388440.3412415","url":null,"abstract":"The complexity, dynamics, and scale of data acquired by modern biotechnology increasingly favor model-free computational methods that make minimal assumptions about underlying biological mechanisms. For example, single-cell transcriptome and proteome data have a throughput several orders more than bulk methods. Many model-free statistical methods for pattern discovery such as mutual information and chi-squared tests, however, require discrete data. Most discretization methods minimize squared errors for each variable independently, not necessarily retaining joint patterns. To address this issue, we present a joint grid discretization algorithm that preserves clusters in the original data. We evaluated this algorithm on simulated data to show its advantage over other methods in maintaining clusters as measured by the adjusted Rand index. We also show it promotes global functional patterns over independent patterns. On single-cell proteome and transcriptome of leukemia and healthy blood, joint grid discretization captured known protein-to-RNA regulatory relationships, while revealing previously unknown interactions. As such, the joint grid discretization is applicable as a data transformation step in associative, functional, and causal inference of molecular interactions fundamental to systems biology. The developed software is publicly available at https://cran.r-project.org/package=GridOnClusters","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131251913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Prediction of Large for Gestational Age Infants in Overweight and Obese Women at Approximately 20 Gestational Weeks 超重和肥胖妇女约20孕周时胎龄儿大的预测

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414906

Yuhan Du, J. Mehegan, F. Mcauliffe, C. Mooney

Large for gestational age (LGA) births are associated with many maternal and perinatal complications. As overweight and obesity are risk factors for LGA, we aimed to predict LGA in overweight and obese women at approximately 20 gestational weeks, so that we can identify women at risk of LGA early to allow for appropriate interventions. A random forest algorithm was applied to maternal characteristics and blood biomarkers at baseline and 20 gestational weeks' ultrasound scan findings to develop a prediction model. Here we present our preliminary results demonstrating potential for use in clinical decision support for identifying patients early in pregnancy at risk of an LGA birth.

大胎龄(LGA)分娩与许多产妇和围产期并发症有关。由于超重和肥胖是LGA的危险因素，我们的目的是预测大约20孕周时超重和肥胖妇女的LGA，以便我们能够早期识别有LGA风险的妇女，以便采取适当的干预措施。将随机森林算法应用于母体特征和血液生物标志物的基线和妊娠20周超声扫描结果，建立预测模型。在这里，我们提出了我们的初步结果，证明了在临床决策支持中使用的潜力，以识别早期妊娠患者的LGA分娩风险。

引用次数: 2

Representing Cellular Lines with SVM and Text Processing 用支持向量机和文本处理表示细胞线条

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414912

I. Carrera, I. Dutra, E. Tejera

A main problem for predicting cell line interactions with chemical compounds is the lack of a computational representation for cell lines. We describe a method for characterizing cell lines from scientific literature. We retrieve and process cell line-related scientific papers, perform a document classification algorithm, and then obtain a description of the information space of each cell line. We have successfully characterized a set of 300+ cell lines.

预测细胞系与化合物相互作用的一个主要问题是缺乏细胞系的计算表示。我们从科学文献中描述了一种表征细胞系的方法。我们检索和处理与细胞系相关的科学论文，执行文档分类算法，然后获得每个细胞系的信息空间描述。我们已经成功地鉴定了一组300多个细胞系。

引用次数: 0