首页 > 最新文献

ACM-BCB ... ... : the ... ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM Conference on Bioinformatics, Computational Biology and Biomedicine最新文献

英文 中文
Multi-Group Tensor Canonical Correlation Analysis. 多群张量典型相关分析。
Zhuoping Zhou, Boning Tong, Davoud Ataee Tarzanagh, Bojian Hou, Andrew J Saykin, Qi Long, Li Shen

Tensor Canonical Correlation Analysis (TCCA) is a commonly employed statistical method utilized to examine linear associations between two sets of tensor datasets. However, the existing TCCA models fail to adequately address the heterogeneity present in real-world tensor data, such as brain imaging data collected from diverse groups characterized by factors like sex and race. Consequently, these models may yield biased outcomes. In order to surmount this constraint, we propose a novel approach called Multi-Group TCCA (MG-TCCA), which enables the joint analysis of multiple subgroups. By incorporating a dual sparsity structure and a block coordinate ascent algorithm, our MG-TCCA method effectively addresses heterogeneity and leverages information across different groups to identify consistent signals. This novel approach facilitates the quantification of shared and individual structures, reduces data dimensionality, and enables visual exploration. To empirically validate our approach, we conduct a study focused on investigating correlations between two brain positron emission tomography (PET) modalities (AV-45 and FDG) within an Alzheimer's disease (AD) cohort. Our results demonstrate that MG-TCCA surpasses traditional TCCA in identifying sex-specific cross-modality imaging correlations. This heightened performance of MG-TCCA provides valuable insights for the characterization of multimodal imaging biomarkers in AD.

张量标准相关分析(TCCA)是一种常用的统计方法,用于检查两组张量数据集之间的线性关联。然而,现有的TCCA模型未能充分解决现实世界张量数据中存在的异质性,例如从以性别和种族等因素为特征的不同群体收集的大脑成像数据。因此,这些模型可能会产生有偏见的结果。为了克服这一限制,我们提出了一种称为多组TCCA(MG-TCCA)的新方法,该方法能够对多个子组进行联合分析。通过结合双重稀疏性结构和块坐标上升算法,我们的MG-TCCA方法有效地解决了异质性问题,并利用不同组之间的信息来识别一致的信号。这种新颖的方法有助于量化共享和单个结构,降低数据维度,并实现视觉探索。为了实证验证我们的方法,我们进行了一项研究,重点调查阿尔茨海默病(AD)队列中两种大脑正电子发射断层扫描(PET)模式(AV-45和FDG)之间的相关性。我们的研究结果表明,MG-TCCA在识别性别特异性跨模态成像相关性方面超过了传统的TCCA。MG-TCCA的这种提高的性能为AD中多模式成像生物标志物的表征提供了有价值的见解。
{"title":"Multi-Group Tensor Canonical Correlation Analysis.","authors":"Zhuoping Zhou, Boning Tong, Davoud Ataee Tarzanagh, Bojian Hou, Andrew J Saykin, Qi Long, Li Shen","doi":"10.1145/3584371.3612962","DOIUrl":"10.1145/3584371.3612962","url":null,"abstract":"<p><p>Tensor Canonical Correlation Analysis (TCCA) is a commonly employed statistical method utilized to examine linear associations between two sets of tensor datasets. However, the existing TCCA models fail to adequately address the heterogeneity present in real-world tensor data, such as brain imaging data collected from diverse groups characterized by factors like sex and race. Consequently, these models may yield biased outcomes. In order to surmount this constraint, we propose a novel approach called Multi-Group TCCA (MG-TCCA), which enables the joint analysis of multiple subgroups. By incorporating a dual sparsity structure and a block coordinate ascent algorithm, our MG-TCCA method effectively addresses heterogeneity and leverages information across different groups to identify consistent signals. This novel approach facilitates the quantification of shared and individual structures, reduces data dimensionality, and enables visual exploration. To empirically validate our approach, we conduct a study focused on investigating correlations between two brain positron emission tomography (PET) modalities (AV-45 and FDG) within an Alzheimer's disease (AD) cohort. Our results demonstrate that MG-TCCA surpasses traditional TCCA in identifying sex-specific cross-modality imaging correlations. This heightened performance of MG-TCCA provides valuable insights for the characterization of multimodal imaging biomarkers in AD.</p>","PeriodicalId":72044,"journal":{"name":"ACM-BCB ... ... : the ... ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM Conference on Bioinformatics, Computational Biology and Biomedicine","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10593155/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"50159453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Supervised Pretraining through Contrastive Categorical Positive Samplings to Improve COVID-19 Mortality Prediction. 通过对比分类阳性样本进行监督预训练以提高COVID-19死亡率预测。
Tingyi Wanyan, Mingquan Lin, Eyal Klang, Kartikeya M Menon, Faris F Gulamali, Ariful Azad, Yiye Zhang, Ying Ding, Zhangyang Wang, Fei Wang, Benjamin Glicksberg, Yifan Peng

Clinical EHR data is naturally heterogeneous, where it contains abundant sub-phenotype. Such diversity creates challenges for outcome prediction using a machine learning model since it leads to high intra-class variance. To address this issue, we propose a supervised pre-training model with a unique embedded k-nearest-neighbor positive sampling strategy. We demonstrate the enhanced performance value of this framework theoretically and show that it yields highly competitive experimental results in predicting patient mortality in real-world COVID-19 EHR data with a total of over 7,000 patients admitted to a large, urban health system. Our method achieves a better AUROC prediction score of 0.872, which outperforms the alternative pre-training models and traditional machine learning methods. Additionally, our method performs much better when the training data size is small (345 training instances).

临床电子病历数据自然是异质的,其中包含丰富的亚表型。这种多样性给使用机器学习模型进行结果预测带来了挑战,因为它会导致高的类内方差。为了解决这个问题,我们提出了一种具有独特嵌入k-近邻正抽样策略的监督预训练模型。我们从理论上证明了该框架的增强性能价值,并表明它在预测现实世界COVID-19电子健康档案数据中的患者死亡率方面产生了极具竞争力的实验结果,这些数据包括一个大型城市卫生系统共接收的7,000多名患者。该方法的AUROC预测得分为0.872,优于其他预训练模型和传统的机器学习方法。此外,当训练数据规模较小(345个训练实例)时,我们的方法表现得更好。
{"title":"Supervised Pretraining through Contrastive Categorical Positive Samplings to Improve COVID-19 Mortality Prediction.","authors":"Tingyi Wanyan,&nbsp;Mingquan Lin,&nbsp;Eyal Klang,&nbsp;Kartikeya M Menon,&nbsp;Faris F Gulamali,&nbsp;Ariful Azad,&nbsp;Yiye Zhang,&nbsp;Ying Ding,&nbsp;Zhangyang Wang,&nbsp;Fei Wang,&nbsp;Benjamin Glicksberg,&nbsp;Yifan Peng","doi":"10.1145/3535508.3545541","DOIUrl":"https://doi.org/10.1145/3535508.3545541","url":null,"abstract":"<p><p>Clinical EHR data is naturally heterogeneous, where it contains abundant sub-phenotype. Such diversity creates challenges for outcome prediction using a machine learning model since it leads to high intra-class variance. To address this issue, we propose a supervised pre-training model with a unique embedded k-nearest-neighbor positive sampling strategy. We demonstrate the enhanced performance value of this framework theoretically and show that it yields highly competitive experimental results in predicting patient mortality in real-world COVID-19 EHR data with a total of over 7,000 patients admitted to a large, urban health system. Our method achieves a better AUROC prediction score of 0.872, which outperforms the alternative pre-training models and traditional machine learning methods. Additionally, our method performs much better when the training data size is small (345 training instances).</p>","PeriodicalId":72044,"journal":{"name":"ACM-BCB ... ... : the ... ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM Conference on Bioinformatics, Computational Biology and Biomedicine","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9365529/pdf/nihms-1827823.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40609301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Segmenting Thoracic Cavities with Neoplastic Lesions: A Head-to-head Benchmark with Fully Convolutional Neural Networks. 胸腔肿瘤病灶分割:全卷积神经网络的头对头基准。
Zhao Li, Rongbin Li, Kendall J Kiser, Luca Giancardo, W Jim Zheng

Automatic segmentation of thoracic cavity structures in computer tomography (CT) is a key step for applications ranging from radiotherapy planning to imaging biomarker discovery with radiomics approaches. State-of-the-art segmentation can be provided by fully convolutional neural networks such as the U-Net or V-Net. However, there is a very limited body of work on a comparative analysis of the performance of these architectures for chest CTs with significant neoplastic disease. In this work, we compared four different types of fully convolutional architectures using the same pre-processing and post-processing pipelines. These methods were evaluated using a dataset of CT images and thoracic cavity segmentations from 402 cancer patients. We found that these methods achieved very high segmentation performance by benchmarks of three evaluation criteria, i.e. Dice coefficient, average symmetric surface distance and 95% Hausdorff distance. Overall, the two-stage 3D U-Net model performed slightly better than other models, with Dice coefficients for left and right lung reaching 0.947 and 0.952, respectively. However, 3D U-Net model achieved the best performance under the evaluation of HD95 for right lung and ASSD for both left and right lung. These results demonstrate that the current state-of-art deep learning models can work very well for segmenting not only healthy lungs but also the lung containing different stages of cancerous lesions. The comprehensive types of lung masks from these evaluated methods enabled the creation of imaging-based biomarkers representing both healthy lung parenchyma and neoplastic lesions, allowing us to utilize these segmented areas for the downstream analysis, e.g. treatment planning, prognosis and survival prediction.

计算机断层扫描(CT)对胸腔结构的自动分割是放疗计划和放射组学成像生物标志物发现等应用的关键步骤。最先进的分割可以由全卷积神经网络如U-Net或V-Net提供。然而,对于这些结构在具有显著肿瘤性疾病的胸部ct上的表现进行比较分析的工作非常有限。在这项工作中,我们比较了使用相同的预处理和后处理管道的四种不同类型的全卷积架构。使用402例癌症患者的CT图像和胸腔分割数据集对这些方法进行了评估。通过对Dice系数、平均对称表面距离和95% Hausdorff距离三个评价标准进行基准测试,我们发现这些方法获得了非常高的分割性能。总体而言,两阶段三维U-Net模型表现略好于其他模型,左肺和右肺的Dice系数分别达到0.947和0.952。而3D U-Net模型在右肺HD95和左右肺ASSD评价下表现最佳。这些结果表明,目前最先进的深度学习模型不仅可以很好地分割健康的肺,还可以分割含有不同阶段癌症病变的肺。从这些评估方法中获得的综合类型的肺面罩能够创建基于成像的生物标志物,代表健康的肺实质和肿瘤病变,使我们能够利用这些分割区域进行下游分析,例如治疗计划,预后和生存预测。
{"title":"Segmenting Thoracic Cavities with Neoplastic Lesions: A Head-to-head Benchmark with Fully Convolutional Neural Networks.","authors":"Zhao Li,&nbsp;Rongbin Li,&nbsp;Kendall J Kiser,&nbsp;Luca Giancardo,&nbsp;W Jim Zheng","doi":"10.1145/3459930.3469564","DOIUrl":"https://doi.org/10.1145/3459930.3469564","url":null,"abstract":"<p><p>Automatic segmentation of thoracic cavity structures in computer tomography (CT) is a key step for applications ranging from radiotherapy planning to imaging biomarker discovery with radiomics approaches. State-of-the-art segmentation can be provided by fully convolutional neural networks such as the U-Net or V-Net. However, there is a very limited body of work on a comparative analysis of the performance of these architectures for chest CTs with significant neoplastic disease. In this work, we compared four different types of fully convolutional architectures using the same pre-processing and post-processing pipelines. These methods were evaluated using a dataset of CT images and thoracic cavity segmentations from 402 cancer patients. We found that these methods achieved very high segmentation performance by benchmarks of three evaluation criteria, i.e. Dice coefficient, average symmetric surface distance and 95% Hausdorff distance. Overall, the two-stage 3D U-Net model performed slightly better than other models, with Dice coefficients for left and right lung reaching 0.947 and 0.952, respectively. However, 3D U-Net model achieved the best performance under the evaluation of HD95 for right lung and ASSD for both left and right lung. These results demonstrate that the current state-of-art deep learning models can work very well for segmenting not only healthy lungs but also the lung containing different stages of cancerous lesions. The comprehensive types of lung masks from these evaluated methods enabled the creation of imaging-based biomarkers representing both healthy lung parenchyma and neoplastic lesions, allowing us to utilize these segmented areas for the downstream analysis, e.g. treatment planning, prognosis and survival prediction.</p>","PeriodicalId":72044,"journal":{"name":"ACM-BCB ... ... : the ... ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM Conference on Bioinformatics, Computational Biology and Biomedicine","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3459930.3469564","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40323973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Assigning ICD-O-3 Codes to Pathology Reports using Neural Multi-Task Training with Hierarchical Regularization. 使用具有层次规则化的神经多任务训练将ICD-O-3代码分配给病理学报告。
Anthony Rios, Eric B Durbin, Isaac Hands, Ramakanth Kavuluru

Tracking population-level cancer information is essential for researchers, clinicians, policymakers, and the public. Unfortunately, much of the information is stored as unstructured data in pathology reports. Thus, too process the information, we require either automated extraction techniques or manual curation. Moreover, many of the cancer-related concepts appear infrequently in real-world training datasets. Automated extraction is difficult because of the limited data. This study introduces a novel technique that incorporates structured expert knowledge to improve histology and topography code classification models. Using pathology reports collected from the Kentucky Cancer Registry, we introduce a novel multi-task training approach with hierarchical regularization that incorporates structured information about the International Classification of Diseases for Oncology, 3rd Edition classes to improve predictive performance. Overall, we find that our method improves both micro and macro F1. For macro F1, we achieve up to a 6% absolute improvement for topography codes and up to 4% absolute improvement for histology codes.

追踪人群层面的癌症信息对研究人员、临床医生、政策制定者和公众至关重要。不幸的是,大部分信息都作为非结构化数据存储在病理学报告中。因此,在处理信息时,我们需要自动提取技术或手动管理。此外,许多与癌症相关的概念很少出现在现实世界的训练数据集中。由于数据有限,自动提取很困难。本研究介绍了一种新技术,该技术结合了结构化的专家知识来改进组织学和地形图代码分类模型。利用从肯塔基州癌症注册中心收集的病理学报告,我们引入了一种具有分层规则化的新的多任务训练方法,该方法结合了关于国际肿瘤疾病分类第三版课程的结构化信息,以提高预测性能。总的来说,我们发现我们的方法改进了微观和宏观F1。对于宏F1,我们实现了地形代码高达6%的绝对改进和组织学代码高达4%的绝对改进。
{"title":"Assigning ICD-O-3 Codes to Pathology Reports using Neural Multi-Task Training with Hierarchical Regularization.","authors":"Anthony Rios,&nbsp;Eric B Durbin,&nbsp;Isaac Hands,&nbsp;Ramakanth Kavuluru","doi":"10.1145/3459930.3469541","DOIUrl":"10.1145/3459930.3469541","url":null,"abstract":"<p><p>Tracking population-level cancer information is essential for researchers, clinicians, policymakers, and the public. Unfortunately, much of the information is stored as unstructured data in pathology reports. Thus, too process the information, we require either automated extraction techniques or manual curation. Moreover, many of the cancer-related concepts appear infrequently in real-world training datasets. Automated extraction is difficult because of the limited data. This study introduces a novel technique that incorporates structured expert knowledge to improve histology and topography code classification models. Using pathology reports collected from the Kentucky Cancer Registry, we introduce a novel multi-task training approach with hierarchical regularization that incorporates structured information about the International Classification of Diseases for Oncology, 3rd Edition classes to improve predictive performance. Overall, we find that our method improves both micro and macro F1. For macro F1, we achieve up to a 6% absolute improvement for topography codes and up to 4% absolute improvement for histology codes.</p>","PeriodicalId":72044,"journal":{"name":"ACM-BCB ... ... : the ... ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM Conference on Bioinformatics, Computational Biology and Biomedicine","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3459930.3469541","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39453028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Fast and memory-efficient scRNA-seq k-means clustering with various distances. 快速和高效的scRNA-seq - k-means聚类与不同的距离。
Daniel N Baker, Nathan Dyjack, Vladimir Braverman, Stephanie C Hicks, Ben Langmead
Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient k-means++ center finding and k-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore's novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattacharyya distance, which can be directly applied to count data and probability distributions. Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and mini-batch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels. Availability: The open source library is at https://github.com/dnbaker/minicore. Code used for experiments is at https://github.com/dnbaker/minicore-experiments.
单细胞RNA测序(scRNA-seq)分析通常从逐个细胞的基因表达矩阵聚类开始,以经验定义具有相似表达谱的细胞组。我们描述了用于scRNA-seq数据的有效k-means++中心查找和k-means聚类的新方法和新的开源库minicore。Minicore处理稀疏计数数据,因为它来自典型的scRNA-seq实验,以及降维后的密集数据。Minicore新颖的矢量化加权储层采样算法使其能够使用20个线程在1.5分钟内找到400万个单元数据集的初始k均值++中心。Minicore可以使用欧几里得距离进行聚类,但也支持更广泛的度量,如Jensen Shannon散度、Kullback Leibler散度和Bhattachaiya距离,这些度量可以直接应用于计数数据和概率分布。此外,对于具有数百万个细胞的scRNA-seq数据集,minicore比scikit learn更有效地产生成本更低的中心。通过仔细处理先验,minicore只需少量即可实现这些距离测量(k-means++、localsearch++和迷你批处理k-means可以在几分钟内对400万个细胞数据集进行聚类,使用不到10GiB的RAM。这种内存效率可以在笔记本电脑和其他商品硬件上实现图谱规模的聚类。最后,我们报告了距离测量得出的聚类与已知细胞类型标签最一致的结果。
{"title":"Fast and memory-efficient scRNA-seq <i>k</i>-means clustering with various distances.","authors":"Daniel N Baker,&nbsp;Nathan Dyjack,&nbsp;Vladimir Braverman,&nbsp;Stephanie C Hicks,&nbsp;Ben Langmead","doi":"10.1145/3459930.3469523","DOIUrl":"10.1145/3459930.3469523","url":null,"abstract":"Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient k-means++ center finding and k-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore's novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattacharyya distance, which can be directly applied to count data and probability distributions. Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and mini-batch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels. Availability: The open source library is at https://github.com/dnbaker/minicore. Code used for experiments is at https://github.com/dnbaker/minicore-experiments.","PeriodicalId":72044,"journal":{"name":"ACM-BCB ... ... : the ... ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM Conference on Bioinformatics, Computational Biology and Biomedicine","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8586878/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39733090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Joint Learning for Biomedical NER and Entity Normalization: Encoding Schemes, Counterfactual Examples, and Zero-Shot Evaluation. 生物医学NER和实体归一化的联合学习:编码方案,反事实示例和零射击评估。
Jiho Noh, Ramakanth Kavuluru

Named entity recognition (NER) and normalization (EN) form an indispensable first step to many biomedical natural language processing applications. In biomedical information science, recognizing entities (e.g., genes, diseases, or drugs) and normalizing them to concepts in standard terminologies or thesauri (e.g., Entrez, ICD-10, or RxNorm) is crucial for identifying more informative relations among them that drive disease etiology, progression, and treatment. In this effort we pursue two high level strategies to improve biomedical ER and EN. The first is to decouple standard entity encoding tags (e.g., "B-Drug" for the beginning of a drug) into type tags (e.g., "Drug") and positional tags (e.g., "B"). A second strategy is to use additional counterfactual training examples to handle the issue of models learning spurious correlations between surrounding context and normalized concepts in training data. We conduct elaborate experiments using the MedMentions dataset, the largest dataset of its kind for ER and EN in biomedicine. We find that our first strategy performs better in entity normalization when compared with the standard coding scheme. The second data augmentation strategy uniformly improves performance in span detection, typing, and normalization. The gains from counterfactual examples are more prominent when evaluating in zero-shot settings, for concepts that have never been encountered during training.

命名实体识别(NER)和归一化(EN)是许多生物医学自然语言处理应用不可或缺的第一步。在生物医学信息科学中,识别实体(如基因、疾病或药物)并将其规范化为标准术语或词典中的概念(如Entrez、ICD-10或RxNorm)对于确定它们之间驱动疾病病因、进展和治疗的更多信息关系至关重要。在这项工作中,我们追求两个高水平的战略,以提高生物医学ER和EN。首先是将标准实体编码标签(例如,“B- drug”表示药物的开头)解耦为类型标签(例如,“drug”)和位置标签(例如,“B”)。第二种策略是使用额外的反事实训练示例来处理模型在训练数据中学习周围上下文和规范化概念之间的虚假关联的问题。我们使用med提及数据集进行了详细的实验,med提及数据集是生物医学中同类最大的ER和EN数据集。我们发现,与标准编码方案相比,我们的第一种策略在实体规范化方面表现更好。第二种数据增强策略统一地提高了跨度检测、类型和规范化方面的性能。当在零射击设置中评估时,对于训练中从未遇到过的概念,反事实示例的收益更加突出。
{"title":"Joint Learning for Biomedical NER and Entity Normalization: Encoding Schemes, Counterfactual Examples, and Zero-Shot Evaluation.","authors":"Jiho Noh,&nbsp;Ramakanth Kavuluru","doi":"10.1145/3459930.3469533","DOIUrl":"https://doi.org/10.1145/3459930.3469533","url":null,"abstract":"<p><p>Named entity recognition (NER) and normalization (EN) form an indispensable first step to many biomedical natural language processing applications. In biomedical information science, recognizing entities (e.g., genes, diseases, or drugs) and normalizing them to concepts in standard terminologies or thesauri (e.g., Entrez, ICD-10, or RxNorm) is crucial for identifying more informative relations among them that drive disease etiology, progression, and treatment. In this effort we pursue two high level strategies to improve biomedical ER and EN. The first is to decouple standard entity encoding tags (e.g., \"B-Drug\" for the beginning of a drug) into type tags (e.g., \"Drug\") and positional tags (e.g., \"B\"). A second strategy is to use additional counterfactual training examples to handle the issue of models learning spurious correlations between surrounding context and normalized concepts in training data. We conduct elaborate experiments using the MedMentions dataset, the largest dataset of its kind for ER and EN in biomedicine. We find that our first strategy performs better in entity normalization when compared with the standard coding scheme. The second data augmentation strategy uniformly improves performance in span detection, typing, and normalization. The gains from counterfactual examples are more prominent when evaluating in zero-shot settings, for concepts that have never been encountered during training.</p>","PeriodicalId":72044,"journal":{"name":"ACM-BCB ... ... : the ... ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM Conference on Bioinformatics, Computational Biology and Biomedicine","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3459930.3469533","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39402820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Concurrent Imputation and Prediction on EHR data using Bi-Directional GANs: Bi-GANs for EHR imputation and prediction. 使用双向 GANs 对电子病历数据进行同步估算和预测:用于电子病历估算和预测的双向 GANs。
Mehak Gupta, H Timothy Bunnell, Thao-Ly T Phan, Rahmatollah Beheshti

Working with electronic health records (EHRs) is known to be challenging due to several reasons. These reasons include not having: 1) similar lengths (per visit), 2) the same number of observations (per patient), and 3) complete entries in the available records. These issues hinder the performance of the predictive models created using EHRs. In this paper, we approach these issues by presenting a model for the combined task of imputing and predicting values for the irregularly observed and varying length EHR data with missing entries. Our proposed model (dubbed as Bi-GAN) uses a bidirectional recurrent network in a generative adversarial setting. In this architecture, the generator is a bidirectional recurrent network that receives the EHR data and imputes the existing missing values. The discriminator attempts to discriminate between the actual and the imputed values generated by the generator. Using the input data in its entirety, Bi-GAN learns how to impute missing elements in-between (imputation) or outside of the input time steps (prediction). Our method has three advantages to the state-of-the-art methods in the field: (a) one single model performs both the imputation and prediction tasks; (b) the model can perform predictions using time-series of varying length with missing data; (c) it does not require to know the observation and prediction time window during training and can be used for the predictions with different observation and prediction window lengths, for short- and long-term predictions. We evaluate our model on two large EHR datasets to impute and predict body mass index (BMI) values and show its superior performance in both settings.

众所周知,由于多种原因,使用电子健康记录(EHR)具有挑战性。这些原因包括1)相似的时间长度(每次就诊);2)相同的观察次数(每位患者);3)可用记录中的完整条目。这些问题阻碍了使用电子病历创建的预测模型的性能。在本文中,我们提出了一个模型来解决这些问题,该模型可用于对不规则观察和长度不一且条目缺失的电子病历数据进行估算和预测。我们提出的模型(称为 Bi-GAN)在生成对抗环境中使用双向循环网络。在这一架构中,生成器是一个双向递归网络,它接收电子病历数据并对现有的缺失值进行估算。判别器试图在生成器生成的实际值和估算值之间进行判别。Bi-GAN 使用完整的输入数据,学习如何在输入时间步骤之间(估算)或之外(预测)估算缺失元素。与该领域最先进的方法相比,我们的方法有三个优势:(a) 一个模型就能同时完成估算和预测任务;(b) 该模型可以使用不同长度的时间序列和缺失数据进行预测;(c) 在训练过程中不需要知道观察和预测的时间窗口,可用于不同观察和预测窗口长度的预测,也可用于短期和长期预测。我们在两个大型电子病历数据集上对我们的模型进行了评估,以估算和预测身体质量指数(BMI)值,结果显示该模型在这两种情况下都表现出色。
{"title":"Concurrent Imputation and Prediction on EHR data using Bi-Directional GANs: Bi-GANs for EHR imputation and prediction.","authors":"Mehak Gupta, H Timothy Bunnell, Thao-Ly T Phan, Rahmatollah Beheshti","doi":"10.1145/3459930.3469512","DOIUrl":"10.1145/3459930.3469512","url":null,"abstract":"<p><p>Working with electronic health records (EHRs) is known to be challenging due to several reasons. These reasons include not having: 1) similar lengths (per visit), 2) the same number of observations (per patient), and 3) complete entries in the available records. These issues hinder the performance of the predictive models created using EHRs. In this paper, we approach these issues by presenting a model for the combined task of imputing and predicting values for the irregularly observed and varying length EHR data with missing entries. Our proposed model (dubbed as Bi-GAN) uses a bidirectional recurrent network in a generative adversarial setting. In this architecture, the generator is a bidirectional recurrent network that receives the EHR data and imputes the existing missing values. The discriminator attempts to discriminate between the actual and the imputed values generated by the generator. Using the input data in its entirety, Bi-GAN learns how to impute missing elements in-between (imputation) or outside of the input time steps (prediction). Our method has three advantages to the state-of-the-art methods in the field: (a) one single model performs both the imputation and prediction tasks; (b) the model can perform predictions using time-series of varying length with missing data; (c) it does not require to know the observation and prediction time window during training and can be used for the predictions with different observation and prediction window lengths, for short- and long-term predictions. We evaluate our model on two large EHR datasets to impute and predict body mass index (BMI) values and show its superior performance in both settings.</p>","PeriodicalId":72044,"journal":{"name":"ACM-BCB ... ... : the ... ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM Conference on Bioinformatics, Computational Biology and Biomedicine","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8482531/pdf/nihms-1740754.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39483618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Transformer-Based Named Entity Recognition for Parsing Clinical Trial Eligibility Criteria. 基于变压器的命名实体识别分析临床试验资格标准。
Shubo Tian, Arslan Erdengasileng, Xi Yang, Yi Guo, Yonghui Wu, Jinfeng Zhang, Jiang Bian, Zhe He

The rapid adoption of electronic health records (EHRs) systems has made clinical data available in electronic format for research and for many downstream applications. Electronic screening of potentially eligible patients using these clinical databases for clinical trials is a critical need to improve trial recruitment efficiency. Nevertheless, manually translating free-text eligibility criteria into database queries is labor intensive and inefficient. To facilitate automated screening, free-text eligibility criteria must be structured and coded into a computable format using controlled vocabularies. Named entity recognition (NER) is thus an important first step. In this study, we evaluate 4 state-of-the-art transformer-based NER models on two publicly available annotated corpora of eligibility criteria released by Columbia University (i.e., the Chia data) and Facebook Research (i.e.the FRD data). Four transformer-based models (i.e., BERT, ALBERT, RoBERTa, and ELECTRA) pretrained with general English domain corpora vs. those pretrained with PubMed citations, clinical notes from the MIMIC-III dataset and eligibility criteria extracted from all the clinical trials on ClinicalTrials.gov were compared. Experimental results show that RoBERTa pretrained with MIMIC-III clinical notes and eligibility criteria yielded the highest strict and relaxed F-scores in both the Chia data (i.e., 0.658/0.798) and the FRD data (i.e., 0.785/0.916). With promising NER results, further investigations on building a reliable natural language processing (NLP)-assisted pipeline for automated electronic screening are needed.

电子健康记录(EHRs)系统的迅速采用使得临床数据以电子格式可用于研究和许多下游应用。使用这些临床数据库进行临床试验的潜在合格患者的电子筛选是提高试验招募效率的关键需要。然而,手动将自由文本资格标准转换为数据库查询是劳动密集型且效率低下的。为了方便自动筛选,必须使用受控词汇表将自由文本资格标准结构化并编码为可计算的格式。命名实体识别(NER)因此是重要的第一步。在本研究中,我们在哥伦比亚大学(即中国数据)和Facebook Research(即FRD数据)发布的两个公开可用的资格标准注释语料库上评估了4个最先进的基于变压器的NER模型。四种基于转换器的模型(即BERT、ALBERT、RoBERTa和ELECTRA)使用通用英语领域语料库进行预训练,与使用PubMed引文、MIMIC-III数据集的临床记录和从ClinicalTrials.gov上提取的所有临床试验的资格标准进行预训练的模型进行比较。实验结果表明,使用MIMIC-III临床记录和资格标准进行预训练的RoBERTa在Chia数据(0.658/0.798)和FRD数据(0.785/0.916)中均获得了最高的严格f分和宽松f分。随着NER结果的出现,需要进一步研究建立一个可靠的自然语言处理(NLP)辅助的自动化电子筛选管道。
{"title":"Transformer-Based Named Entity Recognition for Parsing Clinical Trial Eligibility Criteria.","authors":"Shubo Tian,&nbsp;Arslan Erdengasileng,&nbsp;Xi Yang,&nbsp;Yi Guo,&nbsp;Yonghui Wu,&nbsp;Jinfeng Zhang,&nbsp;Jiang Bian,&nbsp;Zhe He","doi":"10.1145/3459930.3469560","DOIUrl":"https://doi.org/10.1145/3459930.3469560","url":null,"abstract":"<p><p>The rapid adoption of electronic health records (EHRs) systems has made clinical data available in electronic format for research and for many downstream applications. Electronic screening of potentially eligible patients using these clinical databases for clinical trials is a critical need to improve trial recruitment efficiency. Nevertheless, manually translating free-text eligibility criteria into database queries is labor intensive and inefficient. To facilitate automated screening, free-text eligibility criteria must be structured and coded into a computable format using controlled vocabularies. Named entity recognition (NER) is thus an important first step. In this study, we evaluate 4 state-of-the-art transformer-based NER models on two publicly available annotated corpora of eligibility criteria released by Columbia University (i.e., the Chia data) and Facebook Research (i.e.the FRD data). Four transformer-based models (i.e., BERT, ALBERT, RoBERTa, and ELECTRA) pretrained with general English domain corpora vs. those pretrained with PubMed citations, clinical notes from the MIMIC-III dataset and eligibility criteria extracted from all the clinical trials on ClinicalTrials.gov were compared. Experimental results show that RoBERTa pretrained with MIMIC-III clinical notes and eligibility criteria yielded the highest strict and relaxed F-scores in both the Chia data (i.e., 0.658/0.798) and the FRD data (i.e., 0.785/0.916). With promising NER results, further investigations on building a reliable natural language processing (NLP)-assisted pipeline for automated electronic screening are needed.</p>","PeriodicalId":72044,"journal":{"name":"ACM-BCB ... ... : the ... ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM Conference on Bioinformatics, Computational Biology and Biomedicine","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3459930.3469560","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39328500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
KGDAL: Knowledge Graph Guided Double Attention LSTM for Rolling Mortality Prediction for AKI-D Patients. KGDAL:知识图谱引导双注意LSTM用于AKI-D患者滚动死亡率预测。
Lucas Jing Liu, Victor Ortiz-Soriano, Javier A Neyra, Jin Chen

With the rapid accumulation of electronic health record (EHR) data, deep learning (DL) models have exhibited promising performance on patient risk prediction. Recent advances have also demonstrated the effectiveness of knowledge graphs (KG) in providing valuable prior knowledge for further improving DL model performance. However, it is still unclear how KG can be utilized to encode high-order relations among clinical concepts and how DL models can make full use of the encoded concept relations to solve real-world healthcare problems and to interpret the outcomes. We propose a novel knowledge graph guided double attention LSTM model named KGDAL for rolling mortality prediction for critically ill patients with acute kidney injury requiring dialysis (AKI-D). KGDAL constructs a KG-based two-dimension attention in both time and feature spaces. In the experiment with two large healthcare datasets, we compared KGDAL with a variety of rolling mortality prediction models and conducted an ablation study to test the effectiveness, efficacy, and contribution of different attention mechanisms. The results showed that KGDAL clearly outperformed all the compared models. Also, KGDAL-derived patient risk trajectories may assist healthcare providers to make timely decisions and actions. The source code, sample data, and manual of KGDAL are available at https://github.com/lucasliu0928/KGDAL.

随着电子病历(EHR)数据的快速积累,深度学习(DL)模型在患者风险预测方面表现出了良好的性能。最近的进展也证明了知识图(KG)在为进一步提高深度学习模型性能提供有价值的先验知识方面的有效性。然而,KG如何用于编码临床概念之间的高阶关系,以及DL模型如何充分利用编码的概念关系来解决现实世界的医疗问题并解释结果,目前尚不清楚。我们提出了一种新的知识图谱引导的双注意LSTM模型KGDAL,用于预测急性肾损伤需要透析的危重患者(AKI-D)的滚动死亡率。KGDAL在时间和特征空间上构建了基于kg的二维注意力。在两个大型医疗数据集的实验中,我们将KGDAL与各种滚动死亡率预测模型进行了比较,并进行了消融研究,以测试不同注意机制的有效性、疗效和贡献。结果表明,KGDAL明显优于所有比较模型。此外,kgdal衍生的患者风险轨迹可以帮助医疗保健提供者及时做出决策和采取行动。KGDAL的源代码、样例数据和手册可在https://github.com/lucasliu0928/KGDAL上获得。
{"title":"KGDAL: Knowledge Graph Guided Double Attention LSTM for Rolling Mortality Prediction for AKI-D Patients.","authors":"Lucas Jing Liu,&nbsp;Victor Ortiz-Soriano,&nbsp;Javier A Neyra,&nbsp;Jin Chen","doi":"10.1145/3459930.3469513","DOIUrl":"https://doi.org/10.1145/3459930.3469513","url":null,"abstract":"<p><p>With the rapid accumulation of electronic health record (EHR) data, deep learning (DL) models have exhibited promising performance on patient risk prediction. Recent advances have also demonstrated the effectiveness of knowledge graphs (KG) in providing valuable prior knowledge for further improving DL model performance. However, it is still unclear how KG can be utilized to encode high-order relations among clinical concepts and how DL models can make full use of the encoded concept relations to solve real-world healthcare problems and to interpret the outcomes. We propose a novel knowledge graph guided double attention LSTM model named KGDAL for rolling mortality prediction for critically ill patients with acute kidney injury requiring dialysis (AKI-D). KGDAL constructs a KG-based two-dimension attention in both time and feature spaces. In the experiment with two large healthcare datasets, we compared KGDAL with a variety of rolling mortality prediction models and conducted an ablation study to test the effectiveness, efficacy, and contribution of different attention mechanisms. The results showed that KGDAL clearly outperformed all the compared models. Also, KGDAL-derived patient risk trajectories may assist healthcare providers to make timely decisions and actions. The source code, sample data, and manual of KGDAL are available at https://github.com/lucasliu0928/KGDAL.</p>","PeriodicalId":72044,"journal":{"name":"ACM-BCB ... ... : the ... ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM Conference on Bioinformatics, Computational Biology and Biomedicine","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3459930.3469513","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39453029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Unsupervised manifold alignment for single-cell multi-omics data. 单细胞多组学数据的无监督流形对齐。
Ritambhara Singh, Pinar Demetci, Giancarlo Bonora, Vijay Ramani, Choli Lee, He Fang, Zhijun Duan, Xinxian Deng, Jay Shendure, Christine Disteche, William Stafford Noble

Integrating single-cell measurements that capture different properties of the genome is vital to extending our understanding of genome biology. This task is challenging due to the lack of a shared axis across datasets obtained from different types of single-cell experiments. For most such datasets, we lack corresponding information among the cells (samples) and the measurements (features). In this scenario, unsupervised algorithms that are capable of aligning single-cell experiments are critical to learning an in silico co-assay that can help draw correspondences among the cells. Maximum mean discrepancy-based manifold alignment (MMD-MA) is such an unsupervised algorithm. Without requiring correspondence information, it can align single-cell datasets from different modalities in a common shared latent space, showing promising results on simulations and a small-scale single-cell experiment with 61 cells. However, it is essential to explore the applicability of this method to larger single-cell experiments with thousands of cells so that it can be of practical interest to the community. In this paper, we apply MMD-MA to two recent datasets that measure transcriptome and chromatin accessibility in ~2000 single cells. To scale the runtime of MMD-MA to a more substantial number of cells, we extend the original implementation to run on GPUs. We also introduce a method to automatically select one of the user-defined parameters, thus reducing the hyperparameter search space. We demonstrate that the proposed extensions allow MMD-MA to accurately align state-of-the-art single-cell experiments.

整合捕获基因组不同特性的单细胞测量对于扩展我们对基因组生物学的理解至关重要。由于从不同类型的单细胞实验中获得的数据集之间缺乏共享轴,因此这项任务具有挑战性。对于大多数这样的数据集,我们缺乏单元(样本)和测量(特征)之间的相应信息。在这种情况下,能够对齐单细胞实验的无监督算法对于学习可以帮助绘制细胞之间对应关系的计算机联合分析至关重要。基于最大平均误差的流形对齐(MMD-MA)就是这样一种无监督算法。在不需要对应信息的情况下,它可以将来自不同模态的单细胞数据集对齐在一个共同的潜在空间中,在模拟和61个细胞的小规模单细胞实验中显示出令人鼓舞的结果。然而,探索这种方法在数千个细胞的大型单细胞实验中的适用性是至关重要的,这样它才能对社区产生实际的兴趣。在本文中,我们将MMD-MA应用于两个最近的数据集,这些数据集测量了约2000个单细胞的转录组和染色质可及性。为了将MMD-MA的运行时扩展到更大数量的单元,我们扩展了原始实现以在gpu上运行。我们还引入了一种自动选择用户自定义参数的方法,从而减少了超参数搜索空间。我们证明,提出的扩展允许MMD-MA准确地对准最先进的单细胞实验。
{"title":"Unsupervised manifold alignment for single-cell multi-omics data.","authors":"Ritambhara Singh,&nbsp;Pinar Demetci,&nbsp;Giancarlo Bonora,&nbsp;Vijay Ramani,&nbsp;Choli Lee,&nbsp;He Fang,&nbsp;Zhijun Duan,&nbsp;Xinxian Deng,&nbsp;Jay Shendure,&nbsp;Christine Disteche,&nbsp;William Stafford Noble","doi":"10.1145/3388440.3412410","DOIUrl":"https://doi.org/10.1145/3388440.3412410","url":null,"abstract":"<p><p>Integrating single-cell measurements that capture different properties of the genome is vital to extending our understanding of genome biology. This task is challenging due to the lack of a shared axis across datasets obtained from different types of single-cell experiments. For most such datasets, we lack corresponding information among the cells (samples) and the measurements (features). In this scenario, unsupervised algorithms that are capable of aligning single-cell experiments are critical to learning an <i>in silico</i> co-assay that can help draw correspondences among the cells. Maximum mean discrepancy-based manifold alignment (MMD-MA) is such an unsupervised algorithm. Without requiring correspondence information, it can align single-cell datasets from different modalities in a common shared latent space, showing promising results on simulations and a small-scale single-cell experiment with 61 cells. However, it is essential to explore the applicability of this method to larger single-cell experiments with thousands of cells so that it can be of practical interest to the community. In this paper, we apply MMD-MA to two recent datasets that measure transcriptome and chromatin accessibility in ~2000 single cells. To scale the runtime of MMD-MA to a more substantial number of cells, we extend the original implementation to run on GPUs. We also introduce a method to automatically select one of the user-defined parameters, thus reducing the hyperparameter search space. We demonstrate that the proposed extensions allow MMD-MA to accurately align state-of-the-art single-cell experiments.</p>","PeriodicalId":72044,"journal":{"name":"ACM-BCB ... ... : the ... ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM Conference on Bioinformatics, Computational Biology and Biomedicine","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3388440.3412410","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10130200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
期刊
ACM-BCB ... ... : the ... ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM Conference on Bioinformatics, Computational Biology and Biomedicine
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1