Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics最新文献_第7页

Machine Learning Model to Track SARS-CoV-2 Viral Mutation Evolution and Speciation Using Next-generation Sequencing Data 利用新一代测序数据跟踪SARS-CoV-2病毒突变进化和物种形成的机器学习模型

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3415991

I. Derecichei, G. Atikukke

RNA sequence analysis of emerging SARS-CoV-2 infection is valuable for tracking viral evolution and developing novel diagnostic tools. Furthermore, SARS-CoV-2 sequence analysis can provide insight into potential antigenic drift events that lead to strain speciation and changing clinical outcomes. In this work, we aim to develop a pipeline using next-generation sequencing (NGS) technology in addition to machine learning/bioinformatics to track the accumulation of mutations and viral evolution.

新发SARS-CoV-2感染的RNA序列分析对于追踪病毒进化和开发新的诊断工具具有重要价值。此外，SARS-CoV-2序列分析可以深入了解导致菌株物种形成和改变临床结果的潜在抗原漂移事件。在这项工作中，我们的目标是利用下一代测序(NGS)技术以及机器学习/生物信息学开发一个管道，以跟踪突变的积累和病毒的进化。

引用次数: 2

GANDALF: Peptide Generation for Drug Design using Sequential and Structural Generative Adversarial Networks 甘道夫:使用序列和结构生成对抗网络进行药物设计的肽生成

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3412487

Allison M. Rossetto, Wenjin Zhou

Computational drug design has the potential to save time and money by providing a better starting point for new drugs with a complete computational evaluation. We propose a peptide design system for protein targets based on a Generative Adversarial Network (GAN) called GANDALF (Generative Adversarial Network Drug-tArget Ligand Fructifier). GAN based methods have been developed for computational drug design but these can only generate small molecules, not peptides. Peptides are very complex macromolecules which makes them much more difficult than small molecules to generate. Our GANDALF methodology uses two networks to generate a new peptide sequence and structure. It also incorporates data such as active atoms not used in other methods. Active atoms are important because they interact via electron sharing when a target protein and a peptide bind to each other. We can identify the active atoms using our electron structure calculation (eCADD) program and the rules of interaction we have developed. Our method goes farther than comparable methods by generating a full peptide structure as well as predicting binding affinity. The results were validated using a multi-step process comparing the results with FDA approved drugs and our initial prototype method. We have generated multiple peptides for three targets of interest (PD-1, PDL-1, and CTLA-4) and have found that the best generated peptide for each target was comparable to the FDA approved drugs in binding affinity and fitness of 3D binding as well as show the generated peptides were unique from the existing FDA drugs.

通过完整的计算评估为新药提供更好的起点，计算药物设计具有节省时间和金钱的潜力。我们提出了一种基于生成式对抗网络(GAN)的蛋白质靶标肽设计系统，称为GANDALF(生成式对抗网络药物靶配体果子)。基于GAN的方法已经开发用于计算药物设计，但这些方法只能产生小分子，而不是肽。肽是非常复杂的大分子，这使得它们比小分子更难生成。我们的GANDALF方法使用两个网络来生成新的肽序列和结构。它还包含了其他方法中未使用的活性原子等数据。活性原子很重要，因为当目标蛋白和肽相互结合时，它们通过电子共享相互作用。我们可以利用我们的电子结构计算(eCADD)程序和我们开发的相互作用规则来识别活性原子。我们的方法通过生成完整的肽结构以及预测结合亲和力，比同类方法走得更远。通过将结果与FDA批准的药物和我们最初的原型方法进行比较，通过多步骤过程验证了结果。我们已经为三个感兴趣的靶点(PD-1, PDL-1和CTLA-4)生成了多个多肽，并发现每个靶点生成的最佳多肽在结合亲和力和3D结合适应度方面与FDA批准的药物相当，并且显示生成的多肽与现有的FDA药物是独一无二的。

{"title":"GANDALF: Peptide Generation for Drug Design using Sequential and Structural Generative Adversarial Networks","authors":"Allison M. Rossetto, Wenjin Zhou","doi":"10.1145/3388440.3412487","DOIUrl":"https://doi.org/10.1145/3388440.3412487","url":null,"abstract":"Computational drug design has the potential to save time and money by providing a better starting point for new drugs with a complete computational evaluation. We propose a peptide design system for protein targets based on a Generative Adversarial Network (GAN) called GANDALF (Generative Adversarial Network Drug-tArget Ligand Fructifier). GAN based methods have been developed for computational drug design but these can only generate small molecules, not peptides. Peptides are very complex macromolecules which makes them much more difficult than small molecules to generate. Our GANDALF methodology uses two networks to generate a new peptide sequence and structure. It also incorporates data such as active atoms not used in other methods. Active atoms are important because they interact via electron sharing when a target protein and a peptide bind to each other. We can identify the active atoms using our electron structure calculation (eCADD) program and the rules of interaction we have developed. Our method goes farther than comparable methods by generating a full peptide structure as well as predicting binding affinity. The results were validated using a multi-step process comparing the results with FDA approved drugs and our initial prototype method. We have generated multiple peptides for three targets of interest (PD-1, PDL-1, and CTLA-4) and have found that the best generated peptide for each target was comparable to the FDA approved drugs in binding affinity and fitness of 3D binding as well as show the generated peptides were unique from the existing FDA drugs.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132167428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Fusion Lasso and Its Applications to Cancer Subtype and Stage Prediction 融合套索及其在肿瘤亚型和分期预测中的应用

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3412461

Zhong Chen, Andrea Edwards, Kun Zhang

Effectively integrating and mining multi-view, high-dimensional omics data is instrumental to precision medicine. Numerous methods have been proposed for addressing this problem. However, they more or less neglect the challenges (e.g., interpretability, stability and consistency) pertaining to this integration process, whereby suffering from unstable or inconsistent variable selection and prediction accuracy deterioration. In this paper, we introduce a novel Fusion Lasso (FL) framework in which variable selection and data integration are formulated as a weighted constrained optimization problem. Specifically, four regularization constraints, i.e., sparsity, fusion penalty, instability and inconsistency, are simultaneously taken into account in the fusion model using multi-view data, while sparse features are revealed from data of each individual view through the ℓ1-norm minimization. We use the ADMM and Accelerated ADMM (AADMM) schemes to solve this optimization problem, leading to a scalable model convergence with solid theoretical guarantee. By applying FL to fve multi-omics cancer datasets collected by The Cancer Genome Atlas (TCGA), we demonstrate that FL outperforms popular variable selection and data integration approaches, such as Elastic Net, Precision Lasso, B-RAIL and MDBN, in cancer subtype and/or stage prediction. The proposed method is useful and can be further adopted to systems biology and other advanced clinical research areas where multi-view data integration is a necessity.

有效整合和挖掘多视角、高维组学数据是实现精准医疗的重要手段。已经提出了许多方法来解决这个问题。然而，他们或多或少地忽视了与此集成过程相关的挑战(例如，可解释性，稳定性和一致性)，从而遭受不稳定或不一致的变量选择和预测精度下降。在本文中，我们引入了一种新的融合Lasso (FL)框架，其中变量选择和数据集成被表述为一个加权约束优化问题。具体而言，该融合模型同时考虑了稀疏性、融合惩罚、不稳定性和不一致性四个正则化约束，并通过1范数最小化从每个单独视图的数据中揭示稀疏特征。我们使用ADMM和加速ADMM (AADMM)方案来解决这一优化问题，从而获得了具有坚实理论保证的可扩展模型收敛性。通过将FL应用于癌症基因组图谱(TCGA)收集的5个多组学癌症数据集，我们证明FL在癌症亚型和/或分期预测方面优于流行的变量选择和数据集成方法，如Elastic Net、Precision Lasso、B-RAIL和MDBN。该方法具有实用价值，可进一步应用于系统生物学和其他需要多视图数据集成的高级临床研究领域。

{"title":"Fusion Lasso and Its Applications to Cancer Subtype and Stage Prediction","authors":"Zhong Chen, Andrea Edwards, Kun Zhang","doi":"10.1145/3388440.3412461","DOIUrl":"https://doi.org/10.1145/3388440.3412461","url":null,"abstract":"Effectively integrating and mining multi-view, high-dimensional omics data is instrumental to precision medicine. Numerous methods have been proposed for addressing this problem. However, they more or less neglect the challenges (e.g., interpretability, stability and consistency) pertaining to this integration process, whereby suffering from unstable or inconsistent variable selection and prediction accuracy deterioration. In this paper, we introduce a novel Fusion Lasso (FL) framework in which variable selection and data integration are formulated as a weighted constrained optimization problem. Specifically, four regularization constraints, i.e., sparsity, fusion penalty, instability and inconsistency, are simultaneously taken into account in the fusion model using multi-view data, while sparse features are revealed from data of each individual view through the ℓ1-norm minimization. We use the ADMM and Accelerated ADMM (AADMM) schemes to solve this optimization problem, leading to a scalable model convergence with solid theoretical guarantee. By applying FL to fve multi-omics cancer datasets collected by The Cancer Genome Atlas (TCGA), we demonstrate that FL outperforms popular variable selection and data integration approaches, such as Elastic Net, Precision Lasso, B-RAIL and MDBN, in cancer subtype and/or stage prediction. The proposed method is useful and can be further adopted to systems biology and other advanced clinical research areas where multi-view data integration is a necessity.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"167 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132030650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Impactful Mutations in Mpro of the SARS-CoV-2 Proteome SARS-CoV-2蛋白组Mpro的影响突变

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414706

G. Wolfe, Othmane Belhoussine, Anais Dawson, Maxwell Lisaius, F. Jagodzinski

We explore how amino acid mutations affect the stability of the 306 residue main protease of the COVID-19 proteome. We employ two computational approaches, Site Directed Mutagenesis (SDM) and short runs of Molecular Dynamics. We focus our attention on residues 25-32 that make up a beta sheet of a canonical beta barrel close to an active site which includes Histidine 41. We considered this region a good candidate for mutations because such a large perturbation of a highly structured region close to the active site may prove to be highly detrimental to the protein's stability and may affect catalytic efficiency. Understanding how amino acid mutations affect the stability of the protein can inform efforts to develop pharmacological interventions. We mutated the 8 residues in silico to all other possible amino acids, and analyzed the resulting 152 mutants. Both computational methods predict that only a few specific mutations to some of the 8 residues have a major effect on the structural stability of the protein.

我们探索氨基酸突变如何影响COVID-19蛋白组306残基主要蛋白酶的稳定性。我们采用了两种计算方法，位点定向诱变(SDM)和分子动力学的短期运行。我们将注意力集中在残基25-32上，这些残基构成了靠近活性位点(包括组氨酸41)的规范β桶的β薄片。我们认为这个区域是一个很好的突变候选者，因为靠近活性位点的高度结构化区域的如此大的扰动可能会证明对蛋白质的稳定性非常有害，并可能影响催化效率。了解氨基酸突变如何影响蛋白质的稳定性可以为开发药理学干预措施提供信息。我们将8个残基在硅中突变为所有其他可能的氨基酸，并分析了产生的152个突变体。两种计算方法都预测，只有对8个残基中的一些特定突变才会对蛋白质的结构稳定性产生重大影响。

引用次数: 3

A Novel Pupillometric-Based Application for the Automated Detection of ADHD Using Machine Learning 一种基于瞳孔测量的机器学习自动检测ADHD的新应用

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3412427

William Das, S. Khanna

Attention-deficit/hyperactivity disorder is the most pervasive neurodevelopmental disorder among children and adolescents. Current clinical diagnosis, however, is inaccurate and inefficient, hindering the administration of proper treatment regimens. Clinical assessments are based on qualitative observations of perceived behavior. They are time-consuming and costly, preventing individuals from gaining the support they need to succeed academically, socially, and occupationally. A more accurate and efficient method of detection is necessary to ensure that all children are able to be diagnosed and given proper treatment regimens. This research proposes a novel machine learning-based method to analyze pupil-dynamics data as an objective biomarker to characterize ADHD. After visualizing and engineering pupillometric features, an evaluation of state-of-the-art machine learning algorithms showed that an Ensemble Voting Classifier yielded the optimal binary classification metrics using leave-one-out-cross-validation (LOOCV). The model classified ADHD with 82.1% sensitivity, 72.7% specificity, and 85.6% AUROC. Moreover, novel insights into associations between pupillometric features and the presence of ADHD were garnered and statistically validated. The optimal machine learning model was implemented in a web application that administers a memory task and captures pupil biometrics in real-time to output a probabilistic risk score of a patient having ADHD. This application is the first to use pupil-size dynamics as a biomarker, and offers a time-efficient and accurate approach to detect ADHD in children.

注意缺陷/多动障碍是儿童和青少年中最普遍的神经发育障碍。然而，目前的临床诊断是不准确和低效的，阻碍了适当治疗方案的实施。临床评估是基于对感知行为的定性观察。它们既耗时又昂贵，使个人无法获得学业、社交和职业成功所需的支持。需要一种更准确和有效的检测方法，以确保所有儿童都能得到诊断并得到适当的治疗方案。本研究提出了一种新的基于机器学习的方法来分析瞳孔动力学数据，作为表征ADHD的客观生物标志物。在可视化和工程化瞳孔特征之后，对最先进的机器学习算法的评估表明，集成投票分类器使用留一出交叉验证(LOOCV)产生了最佳的二元分类指标。该模型对ADHD的分类灵敏度为82.1%，特异性为72.7%，AUROC为85.6%。此外，瞳孔测量特征与ADHD存在之间的关联的新见解得到了收集和统计验证。最佳机器学习模型在一个web应用程序中实现，该应用程序管理记忆任务并实时捕获瞳孔生物特征，以输出患有ADHD患者的概率风险评分。这个应用程序是第一个使用瞳孔大小动态作为生物标志物，并提供了一个时间效率和准确的方法来检测儿童多动症。

{"title":"A Novel Pupillometric-Based Application for the Automated Detection of ADHD Using Machine Learning","authors":"William Das, S. Khanna","doi":"10.1145/3388440.3412427","DOIUrl":"https://doi.org/10.1145/3388440.3412427","url":null,"abstract":"Attention-deficit/hyperactivity disorder is the most pervasive neurodevelopmental disorder among children and adolescents. Current clinical diagnosis, however, is inaccurate and inefficient, hindering the administration of proper treatment regimens. Clinical assessments are based on qualitative observations of perceived behavior. They are time-consuming and costly, preventing individuals from gaining the support they need to succeed academically, socially, and occupationally. A more accurate and efficient method of detection is necessary to ensure that all children are able to be diagnosed and given proper treatment regimens. This research proposes a novel machine learning-based method to analyze pupil-dynamics data as an objective biomarker to characterize ADHD. After visualizing and engineering pupillometric features, an evaluation of state-of-the-art machine learning algorithms showed that an Ensemble Voting Classifier yielded the optimal binary classification metrics using leave-one-out-cross-validation (LOOCV). The model classified ADHD with 82.1% sensitivity, 72.7% specificity, and 85.6% AUROC. Moreover, novel insights into associations between pupillometric features and the presence of ADHD were garnered and statistically validated. The optimal machine learning model was implemented in a web application that administers a memory task and captures pupil biometrics in real-time to output a probabilistic risk score of a patient having ADHD. This application is the first to use pupil-size dynamics as a biomarker, and offers a time-efficient and accurate approach to detect ADHD in children.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131688949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Three Co-expression Pattern Types across Microbial Transcriptional Networks of Plankton in Two Oceanic Waters 两大洋浮游生物微生物转录网络中的三种共表达模式类型

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3412485

Ruby Sharma, Xuye Luo, Sajal Kumar, Mingzhou Song

Patterns of two molecules across biological systems are often labeled as conserved or differential. We argue that this classification is insufficient. Here, we introduce three types of relationships across systems. Upon stimuli, a type-0 pattern arises from conserved circuitry with active conserved trajectory; a type-1 pattern is conserved circuitry with active differential trajectory; a type-2 pattern is rewired circuitry with active trajectory. We present a 1st-order marginal change test, prove its optimality, and establish its asymptotic chi-squared distribution under the null hypothesis of identical marginals across conditions. The test outperformed other methods in detecting 1st-order difference in simulation studies. We also introduce a zeroth-order strength test to assess association of two variables across systems. We compared gene co-expression networks of planktonic microbial communities in cold California coastal water against the warm water of North Pacific Subtropical Gyre. The frequency of type-1 patterns is much higher than those of type-2 and type-0 patterns, revealing that the microbial communities are mostly conserved in molecular circuitry but responded differentially to ocean habitats. Type-1 and 2 patterns are enriched with genes known to respond to environmental changes or stress; type-0 patterns involve genes having essential function such as photosynthesis and general transcription. Our work provides a deep understanding to effects of the environment on gene regulation in microbial communities. The method is generally applicable to other biological systems. All tests are provided in the R package 'DiffXTables' at https://cran.r-project.org/package=DiffXTables. Other source code and lists of significant gene patterns are available at https://www.cs.nmsu.edu/~joemsong/ACM-BCB-2020/Plankton

生物系统中两种分子的模式通常被标记为保守或微分。我们认为这种分类是不充分的。在这里，我们将介绍三种跨系统的关系。在刺激下，0型模式由具有主动守恒轨迹的守恒电路产生;一类模式是带有主动微分轨迹的守恒电路;第二种模式是有活动轨迹的重新布线电路。给出了一个一阶边际变化检验，证明了它的最优性，并在相同边际的零假设下建立了它的渐近卡方分布。在模拟研究中，该方法在检测一阶差分方面优于其他方法。我们还引入了一个零阶强度测试来评估两个变量在系统中的关联。我们比较了寒冷的加利福尼亚沿海水域和温暖的北太平洋副热带环流中浮游微生物群落的基因共表达网络。1型模式的频率远高于2型和0型模式，表明微生物群落在分子电路中大多是保守的，但对海洋生境的响应存在差异。1型和2型模式富含对环境变化或压力作出反应的基因;0型模式涉及具有基本功能的基因，如光合作用和一般转录。我们的工作提供了对环境对微生物群落基因调控的影响的深刻理解。该方法一般适用于其他生物系统。所有测试都在R包“DiffXTables”中提供，地址为https://cran.r-project.org/package=DiffXTables。其他重要基因模式的源代码和列表可在https://www.cs.nmsu.edu/~joemsong/ACM-BCB-2020/Plankton上获得

{"title":"Three Co-expression Pattern Types across Microbial Transcriptional Networks of Plankton in Two Oceanic Waters","authors":"Ruby Sharma, Xuye Luo, Sajal Kumar, Mingzhou Song","doi":"10.1145/3388440.3412485","DOIUrl":"https://doi.org/10.1145/3388440.3412485","url":null,"abstract":"Patterns of two molecules across biological systems are often labeled as conserved or differential. We argue that this classification is insufficient. Here, we introduce three types of relationships across systems. Upon stimuli, a type-0 pattern arises from conserved circuitry with active conserved trajectory; a type-1 pattern is conserved circuitry with active differential trajectory; a type-2 pattern is rewired circuitry with active trajectory. We present a 1st-order marginal change test, prove its optimality, and establish its asymptotic chi-squared distribution under the null hypothesis of identical marginals across conditions. The test outperformed other methods in detecting 1st-order difference in simulation studies. We also introduce a zeroth-order strength test to assess association of two variables across systems. We compared gene co-expression networks of planktonic microbial communities in cold California coastal water against the warm water of North Pacific Subtropical Gyre. The frequency of type-1 patterns is much higher than those of type-2 and type-0 patterns, revealing that the microbial communities are mostly conserved in molecular circuitry but responded differentially to ocean habitats. Type-1 and 2 patterns are enriched with genes known to respond to environmental changes or stress; type-0 patterns involve genes having essential function such as photosynthesis and general transcription. Our work provides a deep understanding to effects of the environment on gene regulation in microbial communities. The method is generally applicable to other biological systems. All tests are provided in the R package 'DiffXTables' at https://cran.r-project.org/package=DiffXTables. Other source code and lists of significant gene patterns are available at https://www.cs.nmsu.edu/~joemsong/ACM-BCB-2020/Plankton","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132719698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Population-scale Genomic Data Augmentation Based on Conditional Generative Adversarial Networks 基于条件生成对抗网络的种群规模基因组数据增强

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3412475

Junjie Chen, M. Mowlaei, Xinghua Shi

Although next generation sequencing technologies have made it possible to quickly generate a large collection of sequences, current genomic data still suffer from small data sizes, imbalances, and biases due to various factors including disease rareness, test affordability, and concerns about privacy and security. In order to address these limitations of genomic data, we develop a Population-scale Genomic Data Augmentation based on Conditional Generative Adversarial Networks (PG-cGAN) to enhance the amount and diversity of genomic data by transforming samples already in the data rather than collecting new samples. Both the generator and discriminator in the PG-CGAN are stacked with convolutional layers to capture the underlying population structure. Our results for augmenting genotypes in human leukocyte antigen (HLA) regions showed that PC-cGAN can generate new genotypes with similar population structure, variant frequency distributions and LD patterns. Since the input for PC-cGAN is the original genomic data without assumptions about prior knowledge, it can be extended to enrich many other types of biomedical data and beyond.

虽然下一代测序技术使快速生成大量序列成为可能，但目前的基因组数据仍然存在数据规模小、不平衡和偏见的问题，原因包括疾病的罕见性、测试的可负担性以及对隐私和安全的担忧。为了解决基因组数据的这些局限性，我们开发了一种基于条件生成对抗网络(PG-cGAN)的种群规模基因组数据增强方法，通过转换数据中已有的样本而不是收集新样本来增强基因组数据的数量和多样性。PG-CGAN中的生成器和鉴别器都用卷积层堆叠以捕获潜在的种群结构。结果表明，PC-cGAN可以产生具有相似群体结构、不同频率分布和LD模式的新基因型。由于PC-cGAN的输入是原始的基因组数据，没有对先验知识的假设，它可以扩展到丰富许多其他类型的生物医学数据，甚至更多。

{"title":"Population-scale Genomic Data Augmentation Based on Conditional Generative Adversarial Networks","authors":"Junjie Chen, M. Mowlaei, Xinghua Shi","doi":"10.1145/3388440.3412475","DOIUrl":"https://doi.org/10.1145/3388440.3412475","url":null,"abstract":"Although next generation sequencing technologies have made it possible to quickly generate a large collection of sequences, current genomic data still suffer from small data sizes, imbalances, and biases due to various factors including disease rareness, test affordability, and concerns about privacy and security. In order to address these limitations of genomic data, we develop a Population-scale Genomic Data Augmentation based on Conditional Generative Adversarial Networks (PG-cGAN) to enhance the amount and diversity of genomic data by transforming samples already in the data rather than collecting new samples. Both the generator and discriminator in the PG-CGAN are stacked with convolutional layers to capture the underlying population structure. Our results for augmenting genotypes in human leukocyte antigen (HLA) regions showed that PC-cGAN can generate new genotypes with similar population structure, variant frequency distributions and LD patterns. Since the input for PC-cGAN is the original genomic data without assumptions about prior knowledge, it can be extended to enrich many other types of biomedical data and beyond.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124672307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Asymptotically Stable Privacy Protection Technique for fMRI Shared Data over Distributed Computer Networks 分布式计算机网络上fMRI共享数据的渐近稳定隐私保护技术

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414863

Naseeb Thapaliya, Lavanya Goluguri, S. Suthaharan

This paper presents a computational technique that leverages the asymptotic-stabilization behavior of transition probabilities that are characterized by two-state Markov chain. These asymptotic probabilities help the computational technique to protect the privacy of the functional magnetic resonance imaging (fMRI) data that is shared over a public distributed computer network. In general, the fMRI signals reveal a large number of correlated brain features that can be utilized in the development of predictive models for extracting brain networks and infer privacy information of an individual. These features make fMRI data highly vulnerable to privacy attacks. To conceal these features for privacy protection, we transform them to an asymptotic state of an fMRI signal using the concepts of asymptotic stabilization with two-sate Markov chain, and the compressed sensing and compressed learning techniques. The proposed predictive model is built using the asymptotically stabilized fMRI signals, rather than the original signals, which enhance the protection of privacy. Hence, the transformed signal, instead of the original signal, may be shared in public computer networks, such as the cloud computing network. The computer simulations show that the proposed predictive model provides very high prediction accuracy, while providing very strong privacy protection.

本文提出了一种利用以两态马尔可夫链为特征的转移概率的渐近稳定行为的计算技术。这些渐近概率有助于计算技术保护在公共分布式计算机网络上共享的功能磁共振成像(fMRI)数据的隐私。总的来说，fMRI信号揭示了大量相关的大脑特征，这些特征可用于开发预测模型，以提取大脑网络并推断个人隐私信息。这些特点使得fMRI数据极易受到隐私攻击。为了隐藏这些特征以保护隐私，我们使用双安全马尔可夫链的渐近稳定概念以及压缩感知和压缩学习技术将它们转换为fMRI信号的渐近状态。该预测模型采用渐近稳定的fMRI信号而不是原始信号，增强了对隐私的保护。因此，变换后的信号可以代替原始信号在公共计算机网络中共享，例如云计算网络。计算机仿真结果表明，该预测模型具有很高的预测精度，同时具有很强的隐私保护能力。

{"title":"Asymptotically Stable Privacy Protection Technique for fMRI Shared Data over Distributed Computer Networks","authors":"Naseeb Thapaliya, Lavanya Goluguri, S. Suthaharan","doi":"10.1145/3388440.3414863","DOIUrl":"https://doi.org/10.1145/3388440.3414863","url":null,"abstract":"This paper presents a computational technique that leverages the asymptotic-stabilization behavior of transition probabilities that are characterized by two-state Markov chain. These asymptotic probabilities help the computational technique to protect the privacy of the functional magnetic resonance imaging (fMRI) data that is shared over a public distributed computer network. In general, the fMRI signals reveal a large number of correlated brain features that can be utilized in the development of predictive models for extracting brain networks and infer privacy information of an individual. These features make fMRI data highly vulnerable to privacy attacks. To conceal these features for privacy protection, we transform them to an asymptotic state of an fMRI signal using the concepts of asymptotic stabilization with two-sate Markov chain, and the compressed sensing and compressed learning techniques. The proposed predictive model is built using the asymptotically stabilized fMRI signals, rather than the original signals, which enhance the protection of privacy. Hence, the transformed signal, instead of the original signal, may be shared in public computer networks, such as the cloud computing network. The computer simulations show that the proposed predictive model provides very high prediction accuracy, while providing very strong privacy protection.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125252850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Processing Millions of Single Cells by SHARP 夏普处理数百万个单细胞

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3414214

Shibiao Wan, Junil Kim, Yiping Fan, Kyoung-Jae Won

Single-cell technologies have received extensive attention from bioinformatics and computational biology communities due to their evolutionary impacts on uncovering novel cell types and intra-population heterogeneity in various domains of biology and medicine. Recent advances on single-cell RNA-sequencing (scRNA-seq) technologies have enabled parallel transcriptomic profiling of millions of cells. However, existing scRNA-seq clustering methods are lack of scalability, time-consuming and prone to information loss during dimension reduction. To address these concerns, we present SHARP [1], an ensemble random projection-based algorithm which is scalable to clustering 10 million cells. By adopting a divide-and-conquer strategy, a sparse random projection and two-layer meta-clustering, SHARP has the following advantages: (1) hyper-faster than existing algorithms; (2) scalable to 10-million cells; (3) accurate in terms of clustering performance; (4) preserving cell-to-cell distance during dimension reduction; and (5) robust to dropouts in scRNA-seq data. Comprehensive benchmarking tests on 20 scRNA-seq datasets demonstrate SHARP remarkably outperforms state-of-the-art methods in terms of speed and accuracy. To the best of our knowledge, SHARP is the only R-based tool that is scalable to clustering 10 million cells. With an avalanche of single cells in different tissues to be sequenced in multiple international projects like The Human Cell Atlas, we believe SHARP will serve as one of the useful and important tools for large-scale single-cell data analysis. Several potential future directions include while keeping the scalability and speed of SHARP, how to extend its functions into rare cell type detection and integrating single cell data from different platforms, experiments and conditions.

单细胞技术由于其在揭示新的细胞类型和种群内异质性方面的进化影响，在生物学和医学的各个领域受到了生物信息学和计算生物学界的广泛关注。单细胞rna测序(scRNA-seq)技术的最新进展使数百万细胞的平行转录组分析成为可能。然而，现有的scRNA-seq聚类方法缺乏可扩展性，耗时长，在降维过程中容易丢失信息。为了解决这些问题，我们提出了SHARP[1]，一种基于集成随机投影的算法，可扩展到聚类1000万个细胞。SHARP采用了分而治之策略、稀疏随机投影和两层元聚类，具有以下优点:(1)比现有算法快得多;(2)可扩展到1000万个单元;(3)聚类性能准确;(4)在降维过程中保持细胞间距离;(5)对scRNA-seq数据的dropouts具有鲁棒性。对20个scRNA-seq数据集的综合基准测试表明，SHARP在速度和准确性方面明显优于最先进的方法。据我们所知，SHARP是唯一一个基于r的工具，可以扩展到集群1000万个单元。随着人类细胞图谱等多个国际项目对不同组织中的大量单细胞进行测序，我们相信SHARP将成为大规模单细胞数据分析的有用和重要工具之一。未来的几个潜在方向包括，在保持SHARP的可扩展性和速度的同时，如何将其功能扩展到稀有细胞类型检测和整合来自不同平台、实验和条件的单细胞数据。

{"title":"Processing Millions of Single Cells by SHARP","authors":"Shibiao Wan, Junil Kim, Yiping Fan, Kyoung-Jae Won","doi":"10.1145/3388440.3414214","DOIUrl":"https://doi.org/10.1145/3388440.3414214","url":null,"abstract":"Single-cell technologies have received extensive attention from bioinformatics and computational biology communities due to their evolutionary impacts on uncovering novel cell types and intra-population heterogeneity in various domains of biology and medicine. Recent advances on single-cell RNA-sequencing (scRNA-seq) technologies have enabled parallel transcriptomic profiling of millions of cells. However, existing scRNA-seq clustering methods are lack of scalability, time-consuming and prone to information loss during dimension reduction. To address these concerns, we present SHARP [1], an ensemble random projection-based algorithm which is scalable to clustering 10 million cells. By adopting a divide-and-conquer strategy, a sparse random projection and two-layer meta-clustering, SHARP has the following advantages: (1) hyper-faster than existing algorithms; (2) scalable to 10-million cells; (3) accurate in terms of clustering performance; (4) preserving cell-to-cell distance during dimension reduction; and (5) robust to dropouts in scRNA-seq data. Comprehensive benchmarking tests on 20 scRNA-seq datasets demonstrate SHARP remarkably outperforms state-of-the-art methods in terms of speed and accuracy. To the best of our knowledge, SHARP is the only R-based tool that is scalable to clustering 10 million cells. With an avalanche of single cells in different tissues to be sequenced in multiple international projects like The Human Cell Atlas, we believe SHARP will serve as one of the useful and important tools for large-scale single-cell data analysis. Several potential future directions include while keeping the scalability and speed of SHARP, how to extend its functions into rare cell type detection and integrating single cell data from different platforms, experiments and conditions.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125338531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fusion Transcript Detection from RNA-Seq using Jaccard Distance 基于Jaccard距离的RNA-Seq融合转录物检测

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Pub Date : 2020-09-21 DOI: 10.1145/3388440.3415585

Hamidreza Mohebbi, Nurit Haspel, D. Simovici, Joyce Quach

Gene fusion events are quite common in prostate, lymphoid, soft tissue, breast, gastric, and lung cancers. This requires fast and accurate fusion detection methods. However, accurate identification requires whole genome sequencing. Current state of the art methods suffer from inefficiency, lack of sufficient accuracy, and generation of high false positive rate. In this research we present a parallel method to convert inefficient categorical space into a compact binary array and therefore, reduce the dimensionality of the data and speed up the computation. FDJD pipeline contains three steps: general alignment, fusion candidate generation, and refinement. In our research, Jaccard distance is used as a similarity measure to find the nearest neighbors of a given query binary fingerprint alongside a fast KNN implementation. We benchmarked our fusion prediction accuracy using both simulated and genuine RNA-Seq data sets. Fusion detection results are compared with the state-of-the-art-methods STAR-Fusion, InFusion and TopHat-Fusion. The paired-end Illumina RNA-Seq genuine data were obtained from 60 publicly available cancer cell line data sets. FDJD showed superior performance compared to popular alternative fusion detection methods in both simulated and genuine data sets. It attained 90% accuracy on simulated fusion transcript inputs. Of a total of 86 fusions predicted by at least three methods, we found 44 experimentally validated fusions using wisdom of crowds approach. FDJD is not the fastest among the studied methods. However, it achieved the highest accuracy.

基因融合事件在前列腺癌、淋巴细胞癌、软组织癌、乳腺癌、胃癌和肺癌中很常见。这就需要快速准确的融合检测方法。然而，准确的鉴定需要全基因组测序。目前最先进的方法存在效率低下、缺乏足够的准确性和产生高假阳性率的问题。在本研究中，我们提出了一种将低效的分类空间转换为紧凑的二进制数组的并行方法，从而降低了数据的维数并加快了计算速度。FDJD管道包含三个步骤:一般对齐、融合候选生成和细化。在我们的研究中，使用Jaccard距离作为相似性度量来查找给定查询二进制指纹的最近邻居以及快速KNN实现。我们使用模拟和真实的RNA-Seq数据集对我们的融合预测精度进行基准测试。将融合检测结果与目前最先进的STAR-Fusion、InFusion和TopHat-Fusion方法进行了比较。配对端Illumina RNA-Seq真实数据来自60个公开可用的癌细胞系数据集。在模拟数据集和真实数据集中，FDJD与流行的替代融合检测方法相比表现出优越的性能。它在模拟融合转录输入上达到90%的准确率。在至少三种方法预测的总共86个融合中，我们发现了44个实验验证的融合，使用群体智慧方法。在所研究的方法中，FDJD并不是最快的。然而，它达到了最高的精度。

{"title":"Fusion Transcript Detection from RNA-Seq using Jaccard Distance","authors":"Hamidreza Mohebbi, Nurit Haspel, D. Simovici, Joyce Quach","doi":"10.1145/3388440.3415585","DOIUrl":"https://doi.org/10.1145/3388440.3415585","url":null,"abstract":"Gene fusion events are quite common in prostate, lymphoid, soft tissue, breast, gastric, and lung cancers. This requires fast and accurate fusion detection methods. However, accurate identification requires whole genome sequencing. Current state of the art methods suffer from inefficiency, lack of sufficient accuracy, and generation of high false positive rate. In this research we present a parallel method to convert inefficient categorical space into a compact binary array and therefore, reduce the dimensionality of the data and speed up the computation. FDJD pipeline contains three steps: general alignment, fusion candidate generation, and refinement. In our research, Jaccard distance is used as a similarity measure to find the nearest neighbors of a given query binary fingerprint alongside a fast KNN implementation. We benchmarked our fusion prediction accuracy using both simulated and genuine RNA-Seq data sets. Fusion detection results are compared with the state-of-the-art-methods STAR-Fusion, InFusion and TopHat-Fusion. The paired-end Illumina RNA-Seq genuine data were obtained from 60 publicly available cancer cell line data sets. FDJD showed superior performance compared to popular alternative fusion detection methods in both simulated and genuine data sets. It attained 90% accuracy on simulated fusion transcript inputs. Of a total of 86 fusions predicted by at least three methods, we found 44 experimentally validated fusions using wisdom of crowds approach. FDJD is not the fastest among the studied methods. However, it achieved the highest accuracy.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121442753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1