首页 > 最新文献

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics最新文献

英文 中文
RA2Vec
Rajitha Yasas Wijesekara, Ashwin Lahorkar, Kunal Rathore, J. Valadi
Protein Function identification has become an important task due to a plethora of new genomes being sequenced. Recently, distributed representation [1] of words in the form of continuous vector representations has been found to be a very efficient way to represent semantic/syntactic information. In this representation, each word is embedded in an n- dimensional space with similar words having proximate vectors in the embedding space. In the popular skip-gram configuration, the current word is used by the model to predict its surrounding words. In this work we introduce reduced amino acid alphabets based, distributed representation for protein sequences. In our RA2Vec (Reduced Alphabets to Vectors) implementation we first map all Swiss-Prot sequences to hydropathy and conformational similarity based reduced form. Further, by employing skip-gram based method, reduced alphabets embedding vectors (RA2Vec) were created for each set. Embedding vectors for sequences with original ProtVec representation [2] were also created. These vectors were created for various combinations of K-grams and vector sizes. All seven combinations of the original ProtVec embedding vectors, Hydropathy based embedding vectors and Conformational Similarity based embedding vectors were then employed as input to Support Vector Machines classifiers and classification models were built. The embedding vectors were further reduced using recursive Feature Elimination (RFE) method to maximize fivefold CV accuracy. We assessed the validity and the utility of the new representations employing five different data sets. Our results with all data sets indicate, certain synergistic combinations of new representations with and without ProtVec embedding can result in significantly improved performance.
{"title":"RA2Vec","authors":"Rajitha Yasas Wijesekara, Ashwin Lahorkar, Kunal Rathore, J. Valadi","doi":"10.1145/3388440.3414925","DOIUrl":"https://doi.org/10.1145/3388440.3414925","url":null,"abstract":"Protein Function identification has become an important task due to a plethora of new genomes being sequenced. Recently, distributed representation [1] of words in the form of continuous vector representations has been found to be a very efficient way to represent semantic/syntactic information. In this representation, each word is embedded in an n- dimensional space with similar words having proximate vectors in the embedding space. In the popular skip-gram configuration, the current word is used by the model to predict its surrounding words. In this work we introduce reduced amino acid alphabets based, distributed representation for protein sequences. In our RA2Vec (Reduced Alphabets to Vectors) implementation we first map all Swiss-Prot sequences to hydropathy and conformational similarity based reduced form. Further, by employing skip-gram based method, reduced alphabets embedding vectors (RA2Vec) were created for each set. Embedding vectors for sequences with original ProtVec representation [2] were also created. These vectors were created for various combinations of K-grams and vector sizes. All seven combinations of the original ProtVec embedding vectors, Hydropathy based embedding vectors and Conformational Similarity based embedding vectors were then employed as input to Support Vector Machines classifiers and classification models were built. The embedding vectors were further reduced using recursive Feature Elimination (RFE) method to maximize fivefold CV accuracy. We assessed the validity and the utility of the new representations employing five different data sets. Our results with all data sets indicate, certain synergistic combinations of new representations with and without ProtVec embedding can result in significantly improved performance.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114045246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Integrative Deep Learning for PanCancer Molecular Subtype Classification Using Histopathological Images and RNAseq Data 基于组织病理图像和RNAseq数据的胰腺癌分子亚型分类整合深度学习
Fatima Zare, J. Noorbakhsh, Tianyu Wang, Jeffrey H. Chuang, S. Nabavi
Deep learning has recently become a key methodology for the study and interpretation of cancer histology images. The ability of convolutional neural networks (CNNs) to automatically learn features from raw data without the need for pathologist expert knowledge, as well as the availability of annotated histopathology datasets, have contributed to a growing interest in deep learning applications to histopathology. In clinical practice for cancer, histopathological images have been commonly used for diagnosis, prognosis, and treatment. Recently, molecular subtype classification has gained significant attention for predicting standard chemotherapy's outcomes and creating personalized targeted cancer therapy. Genomic profiles, especially gene expression data, are mostly used for molecular subtyping. In this study, we developed a novel, PanCancer CNN model based on Google Inception V3 transfer learning to classify molecular subtypes using histopathological images. We used 22,484 Haemotoxylin and Eosin (H&E) slides from 32 cancer types provided by The Cancer Genome Atlas (TCGA) to train and evaluate the model. We showed that by employing deep learning, H&E slides can be used for classification of molecular subtypes of solid tumor samples with the high area under curves (AUCs) (micro-average= 0.90; macro-average=0.90). In cancer studies, combining histopathological images with genomic data has rarely been explored. We investigated the relationship between features extracted from H&E images and features extracted from gene expression profiles. We observed that the features from these two different modalities (H&E images and gene expression values) for molecular subtyping are highly correlated. We, therefore, developed an integrative deep learning model that combines histological images and gene expression profiles. We showed that the integrative model improves the overall performance of the molecular subtypes classification ((AUCs) micro-average= 0.99; macro-average=0.97). These results show that integrating H&E images and gene expression profiles can enhance accuracy of molecular subtype classification.
近年来,深度学习已成为研究和解释癌症组织学图像的关键方法。卷积神经网络(cnn)在不需要病理学家专家知识的情况下从原始数据中自动学习特征的能力,以及注释组织病理学数据集的可用性,使得人们对深度学习在组织病理学中的应用越来越感兴趣。在癌症的临床实践中,组织病理学图像通常用于诊断、预后和治疗。近年来,分子亚型分类在预测标准化疗结果和创建个性化靶向癌症治疗方面受到了极大的关注。基因组图谱,尤其是基因表达数据,主要用于分子分型。在这项研究中,我们基于Google Inception V3迁移学习开发了一种新颖的PanCancer CNN模型,利用组织病理学图像对分子亚型进行分类。我们使用来自癌症基因组图谱(TCGA)提供的32种癌症类型的22,484张血红素和伊红(H&E)幻灯片来训练和评估模型。我们发现,通过深度学习,H&E切片可以用于具有高曲线下面积(aus)的实体肿瘤样本的分子亚型分类(微平均= 0.90;macro-average = 0.90)。在癌症研究中,很少探索将组织病理学图像与基因组数据相结合。我们研究了从H&E图像中提取的特征与从基因表达谱中提取的特征之间的关系。我们观察到这两种不同模式(H&E图像和基因表达值)的分子分型特征是高度相关的。因此,我们开发了一种结合组织学图像和基因表达谱的综合深度学习模型。结果表明,整合模型提高了分子亚型分类的整体性能((aus)微平均= 0.99;macro-average = 0.97)。这些结果表明,将H&E图像与基因表达谱相结合可以提高分子亚型分类的准确性。
{"title":"Integrative Deep Learning for PanCancer Molecular Subtype Classification Using Histopathological Images and RNAseq Data","authors":"Fatima Zare, J. Noorbakhsh, Tianyu Wang, Jeffrey H. Chuang, S. Nabavi","doi":"10.1145/3388440.3412414","DOIUrl":"https://doi.org/10.1145/3388440.3412414","url":null,"abstract":"Deep learning has recently become a key methodology for the study and interpretation of cancer histology images. The ability of convolutional neural networks (CNNs) to automatically learn features from raw data without the need for pathologist expert knowledge, as well as the availability of annotated histopathology datasets, have contributed to a growing interest in deep learning applications to histopathology. In clinical practice for cancer, histopathological images have been commonly used for diagnosis, prognosis, and treatment. Recently, molecular subtype classification has gained significant attention for predicting standard chemotherapy's outcomes and creating personalized targeted cancer therapy. Genomic profiles, especially gene expression data, are mostly used for molecular subtyping. In this study, we developed a novel, PanCancer CNN model based on Google Inception V3 transfer learning to classify molecular subtypes using histopathological images. We used 22,484 Haemotoxylin and Eosin (H&E) slides from 32 cancer types provided by The Cancer Genome Atlas (TCGA) to train and evaluate the model. We showed that by employing deep learning, H&E slides can be used for classification of molecular subtypes of solid tumor samples with the high area under curves (AUCs) (micro-average= 0.90; macro-average=0.90). In cancer studies, combining histopathological images with genomic data has rarely been explored. We investigated the relationship between features extracted from H&E images and features extracted from gene expression profiles. We observed that the features from these two different modalities (H&E images and gene expression values) for molecular subtyping are highly correlated. We, therefore, developed an integrative deep learning model that combines histological images and gene expression profiles. We showed that the integrative model improves the overall performance of the molecular subtypes classification ((AUCs) micro-average= 0.99; macro-average=0.97). These results show that integrating H&E images and gene expression profiles can enhance accuracy of molecular subtype classification.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115506470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Modularity Analysis of Bipartite Networks and Multivariate ANOVA for Identification of Differentially Expressed Proteins in a Mouse Model of Down Syndrome 识别唐氏综合症小鼠模型中差异表达蛋白的二元网络模块性分析和多元方差分析
A. Jazayeri, Sara Pajouhanfar, Sadaf Saba, Christopher C. Yang
Down Syndrome (DS) is one of the most common disorders caused by the presence of an extra copy of chromosome 21. It has been shown that the expression of various genes located on chromosomes other than the extra 21 chromosomes is affected in DS. Given the practical and ethical difficulties in human tissue studies, the Ts65Dn mouse model has been widely used in DS research. In this study, we propose a pipeline composed of a supervised learning approach, modularity analysis of a bipartite network, and multivariate analysis of variance (MANOVA), for identification of differentially expressed proteins (DEP) among different classes of mice models. The proposed pipeline is tested using the expression levels of 77 proteins in eight different classes of mice models. The data includes the protein expression measurements for 34 trisomic Ts65Dn and 38 control mice. Each group is broken up into four classes based on either being stimulated for learning or not, each injected with memantine or saline. The previously proposed approaches have been unable to identify DEP among all of the eight classes simultaneously. Here, we show that our proposed pipeline can successfully identify the set of proteins expressed differently among all the eight classes. The findings of this study can inform the study of learning responses to different treatments and protein-treatment associations in DS. Also, the proposed pipeline can be adopted to identify DEP in DS or other diseases and health conditions, which can consequently inform the development of improved personalized treatment and management strategies.
唐氏综合症(DS)是由21号染色体的额外拷贝引起的最常见的疾病之一。研究表明,除额外的21条染色体外,位于染色体上的各种基因的表达在DS中受到影响。鉴于人体组织研究的现实和伦理困难,Ts65Dn小鼠模型被广泛应用于退行性椎体滑移研究。在这项研究中,我们提出了一个由监督学习方法、二部网络的模块化分析和多变量方差分析(MANOVA)组成的管道,用于识别不同类别小鼠模型中的差异表达蛋白(DEP)。在8种不同类型的小鼠模型中,使用77种蛋白质的表达水平来测试拟议的管道。数据包括34只三体Ts65Dn小鼠和38只对照小鼠的蛋白表达测量。每一组根据是否受到学习刺激分为四组,每组注射美金刚或生理盐水。先前提出的方法无法同时在所有八类中识别DEP。在这里,我们证明了我们提出的管道可以成功地识别出在所有8类中表达不同的一组蛋白质。本研究结果可为研究退行性椎体滑移对不同治疗和蛋白质治疗相关性的学习反应提供信息。此外,拟议的管道可用于识别DS或其他疾病和健康状况中的DEP,从而可以为改进的个性化治疗和管理策略的发展提供信息。
{"title":"Modularity Analysis of Bipartite Networks and Multivariate ANOVA for Identification of Differentially Expressed Proteins in a Mouse Model of Down Syndrome","authors":"A. Jazayeri, Sara Pajouhanfar, Sadaf Saba, Christopher C. Yang","doi":"10.1145/3388440.3412421","DOIUrl":"https://doi.org/10.1145/3388440.3412421","url":null,"abstract":"Down Syndrome (DS) is one of the most common disorders caused by the presence of an extra copy of chromosome 21. It has been shown that the expression of various genes located on chromosomes other than the extra 21 chromosomes is affected in DS. Given the practical and ethical difficulties in human tissue studies, the Ts65Dn mouse model has been widely used in DS research. In this study, we propose a pipeline composed of a supervised learning approach, modularity analysis of a bipartite network, and multivariate analysis of variance (MANOVA), for identification of differentially expressed proteins (DEP) among different classes of mice models. The proposed pipeline is tested using the expression levels of 77 proteins in eight different classes of mice models. The data includes the protein expression measurements for 34 trisomic Ts65Dn and 38 control mice. Each group is broken up into four classes based on either being stimulated for learning or not, each injected with memantine or saline. The previously proposed approaches have been unable to identify DEP among all of the eight classes simultaneously. Here, we show that our proposed pipeline can successfully identify the set of proteins expressed differently among all the eight classes. The findings of this study can inform the study of learning responses to different treatments and protein-treatment associations in DS. Also, the proposed pipeline can be adopted to identify DEP in DS or other diseases and health conditions, which can consequently inform the development of improved personalized treatment and management strategies.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126247530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fusion Transcript Detection from RNA-Seq using Jaccard Distance 基于Jaccard距离的RNA-Seq融合转录物检测
Hamidreza Mohebbi, Nurit Haspel, D. Simovici, Joyce Quach
Gene fusion events are quite common in prostate, lymphoid, soft tissue, breast, gastric, and lung cancers. This requires fast and accurate fusion detection methods. However, accurate identification requires whole genome sequencing. Current state of the art methods suffer from inefficiency, lack of sufficient accuracy, and generation of high false positive rate. In this research we present a parallel method to convert inefficient categorical space into a compact binary array and therefore, reduce the dimensionality of the data and speed up the computation. FDJD pipeline contains three steps: general alignment, fusion candidate generation, and refinement. In our research, Jaccard distance is used as a similarity measure to find the nearest neighbors of a given query binary fingerprint alongside a fast KNN implementation. We benchmarked our fusion prediction accuracy using both simulated and genuine RNA-Seq data sets. Fusion detection results are compared with the state-of-the-art-methods STAR-Fusion, InFusion and TopHat-Fusion. The paired-end Illumina RNA-Seq genuine data were obtained from 60 publicly available cancer cell line data sets. FDJD showed superior performance compared to popular alternative fusion detection methods in both simulated and genuine data sets. It attained 90% accuracy on simulated fusion transcript inputs. Of a total of 86 fusions predicted by at least three methods, we found 44 experimentally validated fusions using wisdom of crowds approach. FDJD is not the fastest among the studied methods. However, it achieved the highest accuracy.
基因融合事件在前列腺癌、淋巴细胞癌、软组织癌、乳腺癌、胃癌和肺癌中很常见。这就需要快速准确的融合检测方法。然而,准确的鉴定需要全基因组测序。目前最先进的方法存在效率低下、缺乏足够的准确性和产生高假阳性率的问题。在本研究中,我们提出了一种将低效的分类空间转换为紧凑的二进制数组的并行方法,从而降低了数据的维数并加快了计算速度。FDJD管道包含三个步骤:一般对齐、融合候选生成和细化。在我们的研究中,使用Jaccard距离作为相似性度量来查找给定查询二进制指纹的最近邻居以及快速KNN实现。我们使用模拟和真实的RNA-Seq数据集对我们的融合预测精度进行基准测试。将融合检测结果与目前最先进的STAR-Fusion、InFusion和TopHat-Fusion方法进行了比较。配对端Illumina RNA-Seq真实数据来自60个公开可用的癌细胞系数据集。在模拟数据集和真实数据集中,FDJD与流行的替代融合检测方法相比表现出优越的性能。它在模拟融合转录输入上达到90%的准确率。在至少三种方法预测的总共86个融合中,我们发现了44个实验验证的融合,使用群体智慧方法。在所研究的方法中,FDJD并不是最快的。然而,它达到了最高的精度。
{"title":"Fusion Transcript Detection from RNA-Seq using Jaccard Distance","authors":"Hamidreza Mohebbi, Nurit Haspel, D. Simovici, Joyce Quach","doi":"10.1145/3388440.3415585","DOIUrl":"https://doi.org/10.1145/3388440.3415585","url":null,"abstract":"Gene fusion events are quite common in prostate, lymphoid, soft tissue, breast, gastric, and lung cancers. This requires fast and accurate fusion detection methods. However, accurate identification requires whole genome sequencing. Current state of the art methods suffer from inefficiency, lack of sufficient accuracy, and generation of high false positive rate. In this research we present a parallel method to convert inefficient categorical space into a compact binary array and therefore, reduce the dimensionality of the data and speed up the computation. FDJD pipeline contains three steps: general alignment, fusion candidate generation, and refinement. In our research, Jaccard distance is used as a similarity measure to find the nearest neighbors of a given query binary fingerprint alongside a fast KNN implementation. We benchmarked our fusion prediction accuracy using both simulated and genuine RNA-Seq data sets. Fusion detection results are compared with the state-of-the-art-methods STAR-Fusion, InFusion and TopHat-Fusion. The paired-end Illumina RNA-Seq genuine data were obtained from 60 publicly available cancer cell line data sets. FDJD showed superior performance compared to popular alternative fusion detection methods in both simulated and genuine data sets. It attained 90% accuracy on simulated fusion transcript inputs. Of a total of 86 fusions predicted by at least three methods, we found 44 experimentally validated fusions using wisdom of crowds approach. FDJD is not the fastest among the studied methods. However, it achieved the highest accuracy.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121442753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Smart Computational Approaches with Advanced Feature Selection Algorithms for Optimizing the Classification of Mobility Data in Health Informatics 基于高级特征选择算法的智能计算方法在健康信息学中优化移动数据分类
E. Rastegari, D. Orn, H. Ali
Recently, wearable mobility monitoring devices have gained a great deal of attention for collecting movement and gait-related data. Moreover, Wearable movement monitoring devices together with machine learning techniques have been shown to be successful in a variety of healthcare applications, including diagnosis, prognosis, and rehabilitation. However, advanced studies are needed to create accurate and robust models that can differentiate between different populations based on their mobility signatures. This is particularly critical for monitoring movement and gait patterns of individuals impacted by neurodegenerative conditions such as Parkinson's Disease (PD). In order to achieve this goal, it is critical to employ a robust approach to model available data and identify the optimal set of movement parameters for the classification process. In this work, we propose a computational approach to identify the best feature selection method for spatiotemporal gait parameters. We investigate several feature selection approaches and analyze their performance as related to the mobility classification problem; including maximum information gain with minimum correlation (MIGMC), maximum signal to noise ratio with minimum correlation (MSNR&MC), genetic algorithms (GA), decision trees (DT) and principal component analysis (PCA). These methods, along with new proposed variations, are assessed in terms of classification accuracy, the number of selected features, and computation time. Data collected from the triaxial accelerometers attached to the ankles of individuals with PD, geriatrics (GE), and healthy elderly (HE) were used to train and test a set of six different machine learning techniques. Our results indicate that three out of six feature selection methods, including GA, MSNR&MC, and a modified version of MIGMC are the best performers regarding the classification accuracy. We also show that higher degrees of robust performances are achieved when employing multiple algorithms, such as decision trees and genetic algorithms. This study provides a critical first step towards the much-needed goal of utilizing data collected from wearable devices to extract important information for the diagnosis and rehabilitation of many movement-related medical conditions.
近年来,可穿戴式移动监测设备因收集运动和步态相关数据而受到广泛关注。此外,可穿戴运动监测设备与机器学习技术已被证明在各种医疗保健应用中取得了成功,包括诊断、预后和康复。然而,需要更深入的研究来创建准确而稳健的模型,以便根据不同种群的流动性特征来区分它们。这对于监测受神经退行性疾病(如帕金森病)影响的个体的运动和步态模式尤其重要。为了实现这一目标,至关重要的是采用一种鲁棒的方法来对可用数据进行建模,并确定分类过程的最佳运动参数集。在这项工作中,我们提出了一种计算方法来识别时空步态参数的最佳特征选择方法。我们研究了几种特征选择方法,并分析了它们与移动性分类问题相关的性能;包括最小相关的最大信息增益(MIGMC)、最小相关的最大信噪比(MSNR&MC)、遗传算法(GA)、决策树(DT)和主成分分析(PCA)。这些方法,连同新的提出的变化,在分类精度,选择的特征的数量和计算时间方面进行评估。从PD患者、老年患者(GE)和健康老年人(HE)脚踝上的三轴加速度计收集的数据用于训练和测试一组六种不同的机器学习技术。我们的结果表明,六种特征选择方法中的三种,包括GA, MSNR&MC和改进版本的MIGMC,在分类精度方面表现最好。我们还表明,当采用多种算法(如决策树和遗传算法)时,可以获得更高程度的鲁棒性能。这项研究为利用从可穿戴设备收集的数据来提取许多与运动相关的医疗条件的诊断和康复的重要信息的急需目标提供了关键的第一步。
{"title":"Smart Computational Approaches with Advanced Feature Selection Algorithms for Optimizing the Classification of Mobility Data in Health Informatics","authors":"E. Rastegari, D. Orn, H. Ali","doi":"10.1145/3388440.3412426","DOIUrl":"https://doi.org/10.1145/3388440.3412426","url":null,"abstract":"Recently, wearable mobility monitoring devices have gained a great deal of attention for collecting movement and gait-related data. Moreover, Wearable movement monitoring devices together with machine learning techniques have been shown to be successful in a variety of healthcare applications, including diagnosis, prognosis, and rehabilitation. However, advanced studies are needed to create accurate and robust models that can differentiate between different populations based on their mobility signatures. This is particularly critical for monitoring movement and gait patterns of individuals impacted by neurodegenerative conditions such as Parkinson's Disease (PD). In order to achieve this goal, it is critical to employ a robust approach to model available data and identify the optimal set of movement parameters for the classification process. In this work, we propose a computational approach to identify the best feature selection method for spatiotemporal gait parameters. We investigate several feature selection approaches and analyze their performance as related to the mobility classification problem; including maximum information gain with minimum correlation (MIGMC), maximum signal to noise ratio with minimum correlation (MSNR&MC), genetic algorithms (GA), decision trees (DT) and principal component analysis (PCA). These methods, along with new proposed variations, are assessed in terms of classification accuracy, the number of selected features, and computation time. Data collected from the triaxial accelerometers attached to the ankles of individuals with PD, geriatrics (GE), and healthy elderly (HE) were used to train and test a set of six different machine learning techniques. Our results indicate that three out of six feature selection methods, including GA, MSNR&MC, and a modified version of MIGMC are the best performers regarding the classification accuracy. We also show that higher degrees of robust performances are achieved when employing multiple algorithms, such as decision trees and genetic algorithms. This study provides a critical first step towards the much-needed goal of utilizing data collected from wearable devices to extract important information for the diagnosis and rehabilitation of many movement-related medical conditions.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131972175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Automated Classification of Acute Rejection from Endomyocardial Biopsies 心肌内膜活检急性排斥反应的自动分类
F. Giuste, M. Venkatesan, Conan Y. Zhao, L. Tong, Yuanda Zhu, S. Deshpande, May D. Wang
Heart transplant rejection must be quickly and accurately identified to optimize anti-rejection therapies and prevent organ loss. Expert evaluation of endomyocardial biopsies is labor-intensive, and prone to human bias, and suffers from low inter-rater agreement. Additionally, the increased utility of digital pathology for biopsy examination has exacerbated the need for additional image quality control. To meet these challenges, we developed a novel transplant rejection detection pipeline which automatically identifies histology slides in need of rescanning and highlights biopsy regions showing potential signs of rejection. Our system leverages a fast and effective automated patch-level quality filter as well as state-of-the-art feature extraction techniques to provide quality whole-slide level labeling of early rejection signs. We successfully identified digital pathology images with poor image quality and leveraged this quality gain to improve our novel weakly-supervised learning model leading to significant transplant rejection classification performance of AUC: 70.12 (±20.74) %.
心脏移植排斥反应必须快速准确地识别,以优化抗排斥治疗和防止器官损失。心内膜活检的专家评估是劳动密集型的,容易受到人为偏见的影响,并且在评估者之间的一致性很低。此外,数字病理学在活检检查中的应用增加了对额外图像质量控制的需求。为了应对这些挑战,我们开发了一种新的移植排斥检测管道,可以自动识别需要重新扫描的组织学切片,并突出显示显示潜在排斥迹象的活检区域。我们的系统利用快速有效的自动化贴片级质量过滤器以及最先进的特征提取技术,提供高质量的全片级早期排斥信号标记。我们成功地识别了图像质量较差的数字病理图像,并利用这种质量增益来改进我们的新型弱监督学习模型,从而使移植排斥分类的AUC达到了70.12(±20.74)%。
{"title":"Automated Classification of Acute Rejection from Endomyocardial Biopsies","authors":"F. Giuste, M. Venkatesan, Conan Y. Zhao, L. Tong, Yuanda Zhu, S. Deshpande, May D. Wang","doi":"10.1145/3388440.3412430","DOIUrl":"https://doi.org/10.1145/3388440.3412430","url":null,"abstract":"Heart transplant rejection must be quickly and accurately identified to optimize anti-rejection therapies and prevent organ loss. Expert evaluation of endomyocardial biopsies is labor-intensive, and prone to human bias, and suffers from low inter-rater agreement. Additionally, the increased utility of digital pathology for biopsy examination has exacerbated the need for additional image quality control. To meet these challenges, we developed a novel transplant rejection detection pipeline which automatically identifies histology slides in need of rescanning and highlights biopsy regions showing potential signs of rejection. Our system leverages a fast and effective automated patch-level quality filter as well as state-of-the-art feature extraction techniques to provide quality whole-slide level labeling of early rejection signs. We successfully identified digital pathology images with poor image quality and leveraged this quality gain to improve our novel weakly-supervised learning model leading to significant transplant rejection classification performance of AUC: 70.12 (±20.74) %.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132088949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Interpretable Molecule Generation via Disentanglement Learning 解纠缠学习的可解释分子生成
Yuanqi Du, Xiaojie Guo, Amarda Shehu, Liang Zhao
Designing molecules with specific structural and functional properties (e.g., drug-likeness and water solubility) is central to advancing drug discovery and material science, but it poses outstanding challenges both in wet and dry laboratories. The search space is vast and rugged. Recent advances in deep generative models are motivating new computational approaches building over deep learning to tackle the molecular space. Despite rapid advancements, state-of-the-art deep generative models for molecule generation have many limitations, including lack of interpretability. In this paper we address this limitation by proposing a generic framework for interpretable molecule generation based on novel disentangled deep graph generative models with property control. Specifically, we propose a disentanglement enhancement strategy for graphs. We also propose new deep neural architecture to achieve the above learning objective for inference and generation for variable-size graphs efficiently. Extensive experimental evaluation demonstrates the superiority of our approach in various critical aspects, such as accuracy, novelty, and disentanglement.
设计具有特定结构和功能特性的分子(例如,药物相似性和水溶性)是推进药物发现和材料科学的核心,但它在干湿实验室中都提出了突出的挑战。搜索空间巨大而崎岖。深度生成模型的最新进展正在推动在深度学习基础上建立新的计算方法来解决分子空间问题。尽管进展迅速,最先进的分子生成深度生成模型有许多局限性,包括缺乏可解释性。在本文中,我们通过提出一个基于具有属性控制的新型解纠缠深度图生成模型的可解释分子生成的通用框架来解决这一限制。具体来说,我们提出了一种图的解纠缠增强策略。我们还提出了一种新的深度神经结构,以有效地实现对变大小图进行推理和生成的学习目标。广泛的实验评估证明了我们的方法在各个关键方面的优势,例如准确性,新颖性和解纠缠。
{"title":"Interpretable Molecule Generation via Disentanglement Learning","authors":"Yuanqi Du, Xiaojie Guo, Amarda Shehu, Liang Zhao","doi":"10.1145/3388440.3414709","DOIUrl":"https://doi.org/10.1145/3388440.3414709","url":null,"abstract":"Designing molecules with specific structural and functional properties (e.g., drug-likeness and water solubility) is central to advancing drug discovery and material science, but it poses outstanding challenges both in wet and dry laboratories. The search space is vast and rugged. Recent advances in deep generative models are motivating new computational approaches building over deep learning to tackle the molecular space. Despite rapid advancements, state-of-the-art deep generative models for molecule generation have many limitations, including lack of interpretability. In this paper we address this limitation by proposing a generic framework for interpretable molecule generation based on novel disentangled deep graph generative models with property control. Specifically, we propose a disentanglement enhancement strategy for graphs. We also propose new deep neural architecture to achieve the above learning objective for inference and generation for variable-size graphs efficiently. Extensive experimental evaluation demonstrates the superiority of our approach in various critical aspects, such as accuracy, novelty, and disentanglement.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130601349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Joint Grid Discretization for Biological Pattern Discovery 生物模式发现的联合网格离散化
Jiandong Wang, Sajal Kumar, Mingzhou Song
The complexity, dynamics, and scale of data acquired by modern biotechnology increasingly favor model-free computational methods that make minimal assumptions about underlying biological mechanisms. For example, single-cell transcriptome and proteome data have a throughput several orders more than bulk methods. Many model-free statistical methods for pattern discovery such as mutual information and chi-squared tests, however, require discrete data. Most discretization methods minimize squared errors for each variable independently, not necessarily retaining joint patterns. To address this issue, we present a joint grid discretization algorithm that preserves clusters in the original data. We evaluated this algorithm on simulated data to show its advantage over other methods in maintaining clusters as measured by the adjusted Rand index. We also show it promotes global functional patterns over independent patterns. On single-cell proteome and transcriptome of leukemia and healthy blood, joint grid discretization captured known protein-to-RNA regulatory relationships, while revealing previously unknown interactions. As such, the joint grid discretization is applicable as a data transformation step in associative, functional, and causal inference of molecular interactions fundamental to systems biology. The developed software is publicly available at https://cran.r-project.org/package=GridOnClusters
现代生物技术所获得的数据的复杂性、动态性和规模越来越倾向于对潜在生物机制做出最小假设的无模型计算方法。例如,单细胞转录组和蛋白质组数据的吞吐量比批量方法高几个数量级。然而,许多用于模式发现的无模型统计方法(如互信息和卡方检验)需要离散数据。大多数离散化方法使每个变量的平方误差最小,而不一定保留联合模式。为了解决这个问题,我们提出了一种联合网格离散化算法,该算法保留了原始数据中的聚类。我们在模拟数据上对该算法进行了评估,以显示其在通过调整后的Rand指数衡量的维持集群方面优于其他方法的优势。我们还表明,它促进了全局功能模式而不是独立模式。在白血病和健康血液的单细胞蛋白质组和转录组中,联合网格离散捕获了已知的蛋白质- rna调节关系,同时揭示了以前未知的相互作用。因此,联合网格离散化适用于系统生物学基础的分子相互作用的联想、功能和因果推理的数据转换步骤。开发的软件可在https://cran.r-project.org/package=GridOnClusters上公开获得
{"title":"Joint Grid Discretization for Biological Pattern Discovery","authors":"Jiandong Wang, Sajal Kumar, Mingzhou Song","doi":"10.1145/3388440.3412415","DOIUrl":"https://doi.org/10.1145/3388440.3412415","url":null,"abstract":"The complexity, dynamics, and scale of data acquired by modern biotechnology increasingly favor model-free computational methods that make minimal assumptions about underlying biological mechanisms. For example, single-cell transcriptome and proteome data have a throughput several orders more than bulk methods. Many model-free statistical methods for pattern discovery such as mutual information and chi-squared tests, however, require discrete data. Most discretization methods minimize squared errors for each variable independently, not necessarily retaining joint patterns. To address this issue, we present a joint grid discretization algorithm that preserves clusters in the original data. We evaluated this algorithm on simulated data to show its advantage over other methods in maintaining clusters as measured by the adjusted Rand index. We also show it promotes global functional patterns over independent patterns. On single-cell proteome and transcriptome of leukemia and healthy blood, joint grid discretization captured known protein-to-RNA regulatory relationships, while revealing previously unknown interactions. As such, the joint grid discretization is applicable as a data transformation step in associative, functional, and causal inference of molecular interactions fundamental to systems biology. The developed software is publicly available at https://cran.r-project.org/package=GridOnClusters","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131251913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Prediction of Large for Gestational Age Infants in Overweight and Obese Women at Approximately 20 Gestational Weeks 超重和肥胖妇女约20孕周时胎龄儿大的预测
Yuhan Du, J. Mehegan, F. Mcauliffe, C. Mooney
Large for gestational age (LGA) births are associated with many maternal and perinatal complications. As overweight and obesity are risk factors for LGA, we aimed to predict LGA in overweight and obese women at approximately 20 gestational weeks, so that we can identify women at risk of LGA early to allow for appropriate interventions. A random forest algorithm was applied to maternal characteristics and blood biomarkers at baseline and 20 gestational weeks' ultrasound scan findings to develop a prediction model. Here we present our preliminary results demonstrating potential for use in clinical decision support for identifying patients early in pregnancy at risk of an LGA birth.
大胎龄(LGA)分娩与许多产妇和围产期并发症有关。由于超重和肥胖是LGA的危险因素,我们的目的是预测大约20孕周时超重和肥胖妇女的LGA,以便我们能够早期识别有LGA风险的妇女,以便采取适当的干预措施。将随机森林算法应用于母体特征和血液生物标志物的基线和妊娠20周超声扫描结果,建立预测模型。在这里,我们提出了我们的初步结果,证明了在临床决策支持中使用的潜力,以识别早期妊娠患者的LGA分娩风险。
{"title":"Prediction of Large for Gestational Age Infants in Overweight and Obese Women at Approximately 20 Gestational Weeks","authors":"Yuhan Du, J. Mehegan, F. Mcauliffe, C. Mooney","doi":"10.1145/3388440.3414906","DOIUrl":"https://doi.org/10.1145/3388440.3414906","url":null,"abstract":"Large for gestational age (LGA) births are associated with many maternal and perinatal complications. As overweight and obesity are risk factors for LGA, we aimed to predict LGA in overweight and obese women at approximately 20 gestational weeks, so that we can identify women at risk of LGA early to allow for appropriate interventions. A random forest algorithm was applied to maternal characteristics and blood biomarkers at baseline and 20 gestational weeks' ultrasound scan findings to develop a prediction model. Here we present our preliminary results demonstrating potential for use in clinical decision support for identifying patients early in pregnancy at risk of an LGA birth.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114551987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Representing Cellular Lines with SVM and Text Processing 用支持向量机和文本处理表示细胞线条
I. Carrera, I. Dutra, E. Tejera
A main problem for predicting cell line interactions with chemical compounds is the lack of a computational representation for cell lines. We describe a method for characterizing cell lines from scientific literature. We retrieve and process cell line-related scientific papers, perform a document classification algorithm, and then obtain a description of the information space of each cell line. We have successfully characterized a set of 300+ cell lines.
预测细胞系与化合物相互作用的一个主要问题是缺乏细胞系的计算表示。我们从科学文献中描述了一种表征细胞系的方法。我们检索和处理与细胞系相关的科学论文,执行文档分类算法,然后获得每个细胞系的信息空间描述。我们已经成功地鉴定了一组300多个细胞系。
{"title":"Representing Cellular Lines with SVM and Text Processing","authors":"I. Carrera, I. Dutra, E. Tejera","doi":"10.1145/3388440.3414912","DOIUrl":"https://doi.org/10.1145/3388440.3414912","url":null,"abstract":"A main problem for predicting cell line interactions with chemical compounds is the lack of a computational representation for cell lines. We describe a method for characterizing cell lines from scientific literature. We retrieve and process cell line-related scientific papers, perform a document classification algorithm, and then obtain a description of the information space of each cell line. We have successfully characterized a set of 300+ cell lines.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114740122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1