Briefings in bioinformatics最新文献_第8页

CMTT-JTracker: a fully test-time adaptive framework serving automated cell lineage construction. CMTT-JTracker：为自动细胞系构建服务的完全测试时间自适应框架。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae591

Liuyin Chen, Sanyuan Fu, Zijun Zhang

Cell tracking is an essential function needed in automated cellular activity monitoring. In practice, processing methods striking a balance between computational efficiency and accuracy as well as demonstrating robust generalizability across diverse cell datasets are highly desired. This paper develops a central-metric fully test-time adaptive framework for cell tracking (CMTT-JTracker). Firstly, a CMTT mechanism is designed for the pre-segmentation of cell images, which enables extracting target information at different resolutions without additional training. Next, a multi-task learning network with the spatial attention scheme is developed to simultaneously realize detection and re-identification tasks based on features extracted by CMTT. Experimental results demonstrate that the CMTT-JTracker exhibits remarkable biological and tracking performance compared with benchmarking tracking methods. It achieves a multiple object tracking accuracy (MOTA) of $0.894$ on Fluo-N2DH-SIM+ and a MOTA of $0.850$ on PhC-C2DL-PSC. Experimental results further confirm that the CMTT applied solely as a segmentation unit outperforms the SOTA segmentation benchmarks on various datasets, particularly excelling in scenarios with dense cells. The Dice coefficients of the CMTT range from a high of $0.928$ to a low of $0.758$ across different datasets.

细胞追踪是自动细胞活动监测所需的一项基本功能。在实践中，人们非常需要在计算效率和准确性之间取得平衡的处理方法，以及在各种细胞数据集上表现出强大的通用性。本文开发了一种用于细胞追踪的中心计量全测试时间自适应框架（CMTT-JTracker）。首先，设计了一种用于细胞图像预分割的 CMTT 机制，无需额外训练即可提取不同分辨率下的目标信息。然后，利用空间注意力方案开发了一个多任务学习网络，根据 CMTT 提取的特征同时实现检测和再识别任务。实验结果表明，与基准跟踪方法相比，CMTT-JTracker 具有显著的生物和跟踪性能。它在 Fluo-N2DH-SIM+ 上实现了 0.894 美元的多目标跟踪精度（MOTA），在 PhC-C2DL-PSC 上实现了 0.850 美元的多目标跟踪精度（MOTA）。实验结果进一步证实，仅作为分割单元应用的 CMTT 在各种数据集上的表现均优于 SOTA 分割基准，尤其是在密集小区场景中。在不同的数据集上，CMTT 的 Dice 系数从最高的 0.928 美元到最低的 0.758 美元不等。

{"title":"CMTT-JTracker: a fully test-time adaptive framework serving automated cell lineage construction.","authors":"Liuyin Chen, Sanyuan Fu, Zijun Zhang","doi":"10.1093/bib/bbae591","DOIUrl":"10.1093/bib/bbae591","url":null,"abstract":"Cell tracking is an essential function needed in automated cellular activity monitoring. In practice, processing methods striking a balance between computational efficiency and accuracy as well as demonstrating robust generalizability across diverse cell datasets are highly desired. This paper develops a central-metric fully test-time adaptive framework for cell tracking (CMTT-JTracker). Firstly, a CMTT mechanism is designed for the pre-segmentation of cell images, which enables extracting target information at different resolutions without additional training. Next, a multi-task learning network with the spatial attention scheme is developed to simultaneously realize detection and re-identification tasks based on features extracted by CMTT. Experimental results demonstrate that the CMTT-JTracker exhibits remarkable biological and tracking performance compared with benchmarking tracking methods. It achieves a multiple object tracking accuracy (MOTA) of $0.894$ on Fluo-N2DH-SIM+ and a MOTA of $0.850$ on PhC-C2DL-PSC. Experimental results further confirm that the CMTT applied solely as a segmentation unit outperforms the SOTA segmentation benchmarks on various datasets, particularly excelling in scenarios with dense cells. The Dice coefficients of the CMTT range from a high of $0.928$ to a low of $0.758$ across different datasets.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11570544/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Digital annealing optimization for natural product structure elucidation. 数字退火优化天然产物结构阐释。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae600

Chien Lee, Pei-Hua Wang, Yufeng Jane Tseng

The digital annealer (DA) leverages its computational capabilities of up to 100 000 bits to address the complex nondeterministic polynomial-time (NP)-complete challenge inherent in elucidating complex structures of natural products. Conventional computational methods often face limitations with complex mixtures, as they struggle to manage the high dimensionality and intertwined relationships typical in natural products, resulting in inefficiencies and inaccuracies. This study reformulates the challenge into a Quadratic Unconstrained Binary Optimization framework, thereby harnessing the quantum-inspired computing power of the DA. Utilizing mass spectrometry data from three distinct herb species and various potential scaffolds, the DA proficiently locates optimal sidechain combinations that adhere to predefined target molecular weights. This methodology enhances the probability of selecting appropriate sidechains and substituted positions and ensures the generation of solutions within a reasonable 5-min window. The findings underscore the transformative potential of the DA in the realms of analytical chemistry and drug discovery, markedly improving both the precision and practicality of natural product structure elucidation.

数字退火器（DA）利用其高达 100 000 位的计算能力，解决了阐明天然产物复杂结构所固有的复杂非确定性多项式时间（NP）完全挑战。传统的计算方法在处理复杂混合物时往往面临限制，因为它们难以处理天然产物中典型的高维度和相互交织的关系，导致效率低下和不准确。本研究将这一挑战重新表述为二次无约束二元优化框架，从而利用了 DA 的量子启发计算能力。利用来自三种不同草药物种和各种潜在支架的质谱数据，DA 能熟练地找到符合预定目标分子量的最佳侧链组合。这种方法提高了选择适当侧链和取代位置的概率，并确保在 5 分钟的合理时间窗口内生成解决方案。这些发现强调了 DA 在分析化学和药物发现领域的变革潜力，显著提高了天然产物结构阐释的精确性和实用性。

{"title":"Digital annealing optimization for natural product structure elucidation.","authors":"Chien Lee, Pei-Hua Wang, Yufeng Jane Tseng","doi":"10.1093/bib/bbae600","DOIUrl":"10.1093/bib/bbae600","url":null,"abstract":"The digital annealer (DA) leverages its computational capabilities of up to 100 000 bits to address the complex nondeterministic polynomial-time (NP)-complete challenge inherent in elucidating complex structures of natural products. Conventional computational methods often face limitations with complex mixtures, as they struggle to manage the high dimensionality and intertwined relationships typical in natural products, resulting in inefficiencies and inaccuracies. This study reformulates the challenge into a Quadratic Unconstrained Binary Optimization framework, thereby harnessing the quantum-inspired computing power of the DA. Utilizing mass spectrometry data from three distinct herb species and various potential scaffolds, the DA proficiently locates optimal sidechain combinations that adhere to predefined target molecular weights. This methodology enhances the probability of selecting appropriate sidechains and substituted positions and ensures the generation of solutions within a reasonable 5-min window. The findings underscore the transformative potential of the DA in the realms of analytical chemistry and drug discovery, markedly improving both the precision and practicality of natural product structure elucidation.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11570542/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An initial game-theoretic assessment of enhanced tissue preparation and imaging protocols for improved deep learning inference of spatial transcriptomics from tissue morphology. 对增强型组织制备和成像规程进行初步博弈论评估，以改进根据组织形态进行空间转录组学的深度学习推断。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae476

Michael Y Fatemi, Yunrui Lu, Alos B Diallo, Gokul Srinivasan, Zarif L Azher, Brock C Christensen, Lucas A Salas, Gregory J Tsongalis, Scott M Palisoul, Laurent Perreard, Fred W Kolling, Louis J Vaickus, Joshua J Levy

The application of deep learning to spatial transcriptomics (ST) can reveal relationships between gene expression and tissue architecture. Prior work has demonstrated that inferring gene expression from tissue histomorphology can discern these spatial molecular markers to enable population scale studies, reducing the fiscal barriers associated with large-scale spatial profiling. However, while most improvements in algorithmic performance have focused on improving model architectures, little is known about how the quality of tissue preparation and imaging can affect deep learning model training for spatial inference from morphology and its potential for widespread clinical adoption. Prior studies for ST inference from histology typically utilize manually stained frozen sections with imaging on non-clinical grade scanners. Training such models on ST cohorts is also costly. We hypothesize that adopting tissue processing and imaging practices that mirror standards for clinical implementation (permanent sections, automated tissue staining, and clinical grade scanning) can significantly improve model performance. An enhanced specimen processing and imaging protocol was developed for deep learning-based ST inference from morphology. This protocol featured the Visium CytAssist assay to permit automated hematoxylin and eosin staining (e.g. Leica Bond), 40×-resolution imaging, and joining of multiple patients' tissue sections per capture area prior to ST profiling. Using a cohort of 13 pathologic T Stage-III stage colorectal cancer patients, we compared the performance of models trained on slide prepared using enhanced versus traditional (i.e. manual staining and low-resolution imaging) protocols. Leveraging Inceptionv3 neural networks, we predicted gene expression across serial, histologically-matched tissue sections using whole slide images (WSI) from both protocols. The data Shapley was used to quantify and compare marginal performance gains on a patient-by-patient basis attributed to using the enhanced protocol versus the actual costs of spatial profiling. Findings indicate that training and validating on WSI acquired through the enhanced protocol as opposed to the traditional method resulted in improved performance at lower fiscal cost. In the realm of ST, the enhancement of deep learning architectures frequently captures the spotlight; however, the significance of specimen processing and imaging is often understated. This research, informed through a game-theoretic lens, underscores the substantial impact that specimen preparation/imaging can have on spatial transcriptomic inference from morphology. It is essential to integrate such optimized processing protocols to facilitate the identification of prognostic markers at a larger scale.

将深度学习应用于空间转录组学（ST）可以揭示基因表达与组织结构之间的关系。之前的工作已经证明，从组织形态学推断基因表达可以发现这些空间分子标记，从而实现群体规模的研究，减少与大规模空间剖析相关的财政障碍。然而，算法性能的提高大多集中在模型架构的改进上，而对于组织制备和成像质量如何影响形态学空间推断的深度学习模型训练及其在临床上广泛应用的潜力却知之甚少。之前从组织学推断 ST 的研究通常使用人工染色的冷冻切片，并在非临床级扫描仪上进行成像。在 ST 队列上训练此类模型的成本也很高。我们假设，采用符合临床实施标准的组织处理和成像方法（永久切片、自动组织染色和临床级扫描）可以显著提高模型的性能。我们开发了一种增强型标本处理和成像方案，用于基于深度学习的ST形态推断。该方案以 Visium CytAssist 检测为特色，允许自动苏木精和伊红染色（如 Leica Bond）、40× 分辨率成像，并在 ST 分析之前将每个捕获区域的多个患者组织切片连接起来。我们使用一组 13 例病理 T-III 期结直肠癌患者，比较了在使用增强和传统（即手动染色和低分辨率成像）方案制备的切片上训练的模型的性能。利用 Inceptionv3 神经网络，我们使用这两种方案的全切片图像（WSI）预测了连续的、组织学上匹配的组织切片上的基因表达。数据 Shapley 用于量化和比较每个患者因使用增强型方案而获得的边际性能收益与空间剖析的实际成本。研究结果表明，与传统方法相比，通过增强型方案获得的 WSI 进行培训和验证，能以较低的财务成本提高性能。在 ST 领域，深度学习架构的增强经常成为焦点；然而，标本处理和成像的重要性往往被低估。这项研究通过博弈论的视角，强调了标本制备/成像对从形态学进行空间转录组推断的重大影响。必须整合这种优化的处理方案，以促进更大规模的预后标志物鉴定。

{"title":"An initial game-theoretic assessment of enhanced tissue preparation and imaging protocols for improved deep learning inference of spatial transcriptomics from tissue morphology.","authors":"Michael Y Fatemi, Yunrui Lu, Alos B Diallo, Gokul Srinivasan, Zarif L Azher, Brock C Christensen, Lucas A Salas, Gregory J Tsongalis, Scott M Palisoul, Laurent Perreard, Fred W Kolling, Louis J Vaickus, Joshua J Levy","doi":"10.1093/bib/bbae476","DOIUrl":"10.1093/bib/bbae476","url":null,"abstract":"The application of deep learning to spatial transcriptomics (ST) can reveal relationships between gene expression and tissue architecture. Prior work has demonstrated that inferring gene expression from tissue histomorphology can discern these spatial molecular markers to enable population scale studies, reducing the fiscal barriers associated with large-scale spatial profiling. However, while most improvements in algorithmic performance have focused on improving model architectures, little is known about how the quality of tissue preparation and imaging can affect deep learning model training for spatial inference from morphology and its potential for widespread clinical adoption. Prior studies for ST inference from histology typically utilize manually stained frozen sections with imaging on non-clinical grade scanners. Training such models on ST cohorts is also costly. We hypothesize that adopting tissue processing and imaging practices that mirror standards for clinical implementation (permanent sections, automated tissue staining, and clinical grade scanning) can significantly improve model performance. An enhanced specimen processing and imaging protocol was developed for deep learning-based ST inference from morphology. This protocol featured the Visium CytAssist assay to permit automated hematoxylin and eosin staining (e.g. Leica Bond), 40×-resolution imaging, and joining of multiple patients' tissue sections per capture area prior to ST profiling. Using a cohort of 13 pathologic T Stage-III stage colorectal cancer patients, we compared the performance of models trained on slide prepared using enhanced versus traditional (i.e. manual staining and low-resolution imaging) protocols. Leveraging Inceptionv3 neural networks, we predicted gene expression across serial, histologically-matched tissue sections using whole slide images (WSI) from both protocols. The data Shapley was used to quantify and compare marginal performance gains on a patient-by-patient basis attributed to using the enhanced protocol versus the actual costs of spatial profiling. Findings indicate that training and validating on WSI acquired through the enhanced protocol as opposed to the traditional method resulted in improved performance at lower fiscal cost. In the realm of ST, the enhancement of deep learning architectures frequently captures the spotlight; however, the significance of specimen processing and imaging is often understated. This research, informed through a game-theoretic lens, underscores the substantial impact that specimen preparation/imaging can have on spatial transcriptomic inference from morphology. It is essential to integrate such optimized processing protocols to facilitate the identification of prognostic markers at a larger scale.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11452536/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142375186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Current computational tools for protein lysine acylation site prediction. 目前用于预测蛋白质赖氨酸酰化位点的计算工具。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae469

Zhaohui Qin, Haoran Ren, Pei Zhao, Kaiyuan Wang, Huixia Liu, Chunbo Miao, Yanxiu Du, Junzhou Li, Liuji Wu, Zhen Chen

As a main subtype of post-translational modification (PTM), protein lysine acylations (PLAs) play crucial roles in regulating diverse functions of proteins. With recent advancements in proteomics technology, the identification of PTM is becoming a data-rich field. A large amount of experimentally verified data is urgently required to be translated into valuable biological insights. With computational approaches, PLA can be accurately detected across the whole proteome, even for organisms with small-scale datasets. Herein, a comprehensive summary of 166 in silico PLA prediction methods is presented, including a single type of PLA site and multiple types of PLA sites. This recapitulation covers important aspects that are critical for the development of a robust predictor, including data collection and preparation, sample selection, feature representation, classification algorithm design, model evaluation, and method availability. Notably, we discuss the application of protein language models and transfer learning to solve the small-sample learning issue. We also highlight the prediction methods developed for functionally relevant PLA sites and species/substrate/cell-type-specific PLA sites. In conclusion, this systematic review could potentially facilitate the development of novel PLA predictors and offer useful insights to researchers from various disciplines.

作为蛋白质翻译后修饰（PTM）的主要亚型，蛋白质赖氨酸酰化（PLAs）在调控蛋白质的多种功能方面发挥着至关重要的作用。随着蛋白质组学技术的不断进步，PTM 的鉴定正成为一个数据丰富的领域。大量经过实验验证的数据急需转化为有价值的生物学见解。利用计算方法，PLA 可以在整个蛋白质组中进行精确检测，即使是小规模数据集的生物体也不例外。本文全面总结了 166 种硅学聚乳酸预测方法，包括单一类型的聚乳酸位点和多种类型的聚乳酸位点。这一概述涵盖了对开发稳健预测方法至关重要的重要方面，包括数据收集和准备、样本选择、特征表示、分类算法设计、模型评估和方法可用性。值得注意的是，我们讨论了蛋白质语言模型和迁移学习的应用，以解决小样本学习问题。我们还重点介绍了针对功能相关的 PLA 位点和物种/底物/细胞类型特异性 PLA 位点开发的预测方法。总之，本系统综述有可能促进新型 PLA 预测方法的开发，并为各学科的研究人员提供有用的见解。

{"title":"Current computational tools for protein lysine acylation site prediction.","authors":"Zhaohui Qin, Haoran Ren, Pei Zhao, Kaiyuan Wang, Huixia Liu, Chunbo Miao, Yanxiu Du, Junzhou Li, Liuji Wu, Zhen Chen","doi":"10.1093/bib/bbae469","DOIUrl":"10.1093/bib/bbae469","url":null,"abstract":"As a main subtype of post-translational modification (PTM), protein lysine acylations (PLAs) play crucial roles in regulating diverse functions of proteins. With recent advancements in proteomics technology, the identification of PTM is becoming a data-rich field. A large amount of experimentally verified data is urgently required to be translated into valuable biological insights. With computational approaches, PLA can be accurately detected across the whole proteome, even for organisms with small-scale datasets. Herein, a comprehensive summary of 166 in silico PLA prediction methods is presented, including a single type of PLA site and multiple types of PLA sites. This recapitulation covers important aspects that are critical for the development of a robust predictor, including data collection and preparation, sample selection, feature representation, classification algorithm design, model evaluation, and method availability. Notably, we discuss the application of protein language models and transfer learning to solve the small-sample learning issue. We also highlight the prediction methods developed for functionally relevant PLA sites and species/substrate/cell-type-specific PLA sites. In conclusion, this systematic review could potentially facilitate the development of novel PLA predictors and offer useful insights to researchers from various disciplines.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11421846/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-Cover Persistence (MCP)-based machine learning for polymer property prediction. 基于多覆盖持久性 (MCP) 的聚合物性能预测机器学习。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae465

Yipeng Zhang, Cong Shen, Kelin Xia

Accurate and efficient prediction of polymers properties is crucial for polymer design. Recently, data-driven artificial intelligence (AI) models have demonstrated great promise in polymers property analysis. Even with the great progresses, a pivotal challenge in all the AI-driven models remains to be the effective representation of molecules. Here we introduce Multi-Cover Persistence (MCP)-based molecular representation and featurization for the first time. Our MCP-based polymer descriptors are combined with machine learning models, in particular, Gradient Boosting Tree (GBT) models, for polymers property prediction. Different from all previous molecular representation, polymer molecular structure and interactions are represented as MCP, which utilizes Delaunay slices at different dimensions and Rhomboid tiling to characterize the complicated geometric and topological information within the data. Statistic features from the generated persistent barcodes are used as polymer descriptors, and further combined with GBT model. Our model has been extensively validated on polymer benchmark datasets. It has been found that our models can outperform traditional fingerprint-based models and has similar accuracy with geometric deep learning models. In particular, our model tends to be more effective on large-sized monomer structures, demonstrating the great potential of MCP in characterizing more complicated polymer data. This work underscores the potential of MCP in polymer informatics, presenting a novel perspective on molecular representation and its application in polymer science.

准确、高效地预测聚合物特性对聚合物设计至关重要。最近，数据驱动的人工智能（AI）模型在聚合物特性分析中展现出了巨大的前景。尽管取得了巨大进步，但所有人工智能驱动模型的一个关键挑战仍然是如何有效地表示分子。在此，我们首次引入了基于多覆盖持久性（MCP）的分子表征和特征化。我们基于 MCP 的聚合物描述符与机器学习模型，特别是梯度提升树（GBT）模型相结合，用于聚合物特性预测。与以往所有的分子表示方法不同，聚合物分子结构和相互作用以 MCP 表示，利用不同维度的 Delaunay 切片和 Rhomboid tiling 来描述数据中复杂的几何和拓扑信息。生成的持久性条形码的统计特征被用作聚合物描述符，并进一步与 GBT 模型相结合。我们的模型已在聚合物基准数据集上进行了广泛验证。结果发现，我们的模型优于传统的基于指纹的模型，其准确性与几何深度学习模型相似。特别是，我们的模型对大尺寸单体结构更有效，这表明 MCP 在表征更复杂的聚合物数据方面具有巨大潜力。这项工作强调了 MCP 在聚合物信息学中的潜力，为分子表征及其在聚合物科学中的应用提出了一个新的视角。

{"title":"Multi-Cover Persistence (MCP)-based machine learning for polymer property prediction.","authors":"Yipeng Zhang, Cong Shen, Kelin Xia","doi":"10.1093/bib/bbae465","DOIUrl":"https://doi.org/10.1093/bib/bbae465","url":null,"abstract":"Accurate and efficient prediction of polymers properties is crucial for polymer design. Recently, data-driven artificial intelligence (AI) models have demonstrated great promise in polymers property analysis. Even with the great progresses, a pivotal challenge in all the AI-driven models remains to be the effective representation of molecules. Here we introduce Multi-Cover Persistence (MCP)-based molecular representation and featurization for the first time. Our MCP-based polymer descriptors are combined with machine learning models, in particular, Gradient Boosting Tree (GBT) models, for polymers property prediction. Different from all previous molecular representation, polymer molecular structure and interactions are represented as MCP, which utilizes Delaunay slices at different dimensions and Rhomboid tiling to characterize the complicated geometric and topological information within the data. Statistic features from the generated persistent barcodes are used as polymer descriptors, and further combined with GBT model. Our model has been extensively validated on polymer benchmark datasets. It has been found that our models can outperform traditional fingerprint-based models and has similar accuracy with geometric deep learning models. In particular, our model tends to be more effective on large-sized monomer structures, demonstrating the great potential of MCP in characterizing more complicated polymer data. This work underscores the potential of MCP in polymer informatics, presenting a novel perspective on molecular representation and its application in polymer science.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11424509/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NASTRA: accurate analysis of short tandem repeat markers by nanopore sequencing with repeat-structure-aware algorithm. NASTRA：利用重复结构感知算法通过纳米孔测序准确分析短串联重复标记。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae472

Zilin Ren, Jiarong Zhang, Yixiang Zhang, Tingting Yang, Pingping Sun, Jiguo Xue, Xiaochen Bo, Bo Zhou, Jiangwei Yan, Ming Ni

Short-tandem repeats (STRs) are the type of genetic markers extensively utilized in biomedical and forensic applications. Due to sequencing noise in nanopore sequencing, accurate analysis methods are lacking. We developed NASTRA, an innovative tool for Nanopore Autosomal Short Tandem Repeat Analysis, which overcomes traditional database-based methods' limitations and provides a precise germline analysis of STR genetic markers without the need for allele sequence reference. Demonstrating high accuracy in cell line authentication testing and paternity testing, NASTRA significantly surpasses existing methods in both speed and accuracy. This advancement makes it a promising solution for rapid cell line authentication and kinship testing, highlighting the potential of nanopore sequencing for in-field applications.

短串联重复序列（STR）是生物医学和法医应用中广泛使用的一类遗传标记。由于纳米孔测序存在测序噪声，因此缺乏精确的分析方法。我们开发了一种用于纳米孔常染色体短串联重复序列分析的创新工具 NASTRA，它克服了传统的基于数据库的方法的局限性，无需等位基因序列参考就能对 STR 遗传标记进行精确的种系分析。NASTRA 在细胞系鉴定测试和亲子鉴定中表现出很高的准确性，在速度和准确性上都大大超过了现有的方法。这一进步使其成为快速细胞系鉴定和亲缘关系测试的理想解决方案，凸显了纳米孔测序在现场应用中的潜力。

引用次数: 0

nsDCC: dual-level contrastive clustering with nonuniform sampling for scRNA-seq data analysis. nsDCC：用于 scRNA-seq 数据分析的非均匀采样双层对比聚类。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae477

Linjie Wang, Wei Li, Fanghui Zhou, Kun Yu, Chaolu Feng, Dazhe Zhao

Dimensionality reduction and clustering are crucial tasks in single-cell RNA sequencing (scRNA-seq) data analysis, treated independently in the current process, hindering their mutual benefits. The latest methods jointly optimize these tasks through deep clustering. However, contrastive learning, with powerful representation capability, can bridge the gap that common deep clustering methods face, which requires pre-defined cluster centers. Therefore, a dual-level contrastive clustering method with nonuniform sampling (nsDCC) is proposed for scRNA-seq data analysis. Dual-level contrastive clustering, which combines instance-level contrast and cluster-level contrast, jointly optimizes dimensionality reduction and clustering. Multi-positive contrastive learning and unit matrix constraint are introduced in instance- and cluster-level contrast, respectively. Furthermore, the attention mechanism is introduced to capture inter-cellular information, which is beneficial for clustering. The nsDCC focuses on important samples at category boundaries and in minority categories by the proposed nearest boundary sparsest density weight assignment algorithm, making it capable of capturing comprehensive characteristics against imbalanced datasets. Experimental results show that nsDCC outperforms the six other state-of-the-art methods on both real and simulated scRNA-seq data, validating its performance on dimensionality reduction and clustering of scRNA-seq data, especially for imbalanced data. Simulation experiments demonstrate that nsDCC is insensitive to "dropout events" in scRNA-seq. Finally, cluster differential expressed gene analysis confirms the meaningfulness of results from nsDCC. In summary, nsDCC is a new way of analyzing and understanding scRNA-seq data.

降维和聚类是单细胞 RNA 测序（scRNA-seq）数据分析中的关键任务，但在目前的流程中，这两项任务被独立处理，阻碍了它们的互惠互利。最新的方法通过深度聚类联合优化了这些任务。然而，对比学习具有强大的表示能力，可以弥补普通深度聚类方法所面临的差距，即需要预先确定聚类中心。因此，我们提出了一种用于 scRNA-seq 数据分析的非均匀采样双层对比聚类方法（nsDCC）。双级对比聚类结合了实例级对比和聚类级对比，共同优化了降维和聚类。在实例级对比和聚类级对比中分别引入了多正向对比学习和单位矩阵约束。此外，还引入了注意力机制来捕捉细胞间的信息，这有利于聚类。nsDCC 通过提出的最近边界稀疏密度权重分配算法，重点关注类别边界和少数类别中的重要样本，使其能够捕捉不平衡数据集的综合特征。实验结果表明，nsDCC 在真实和模拟 scRNA-seq 数据上的表现优于其他六种最先进的方法，验证了它在 scRNA-seq 数据降维和聚类方面的性能，尤其是在不平衡数据上。模拟实验证明，nsDCC 对 scRNA-seq 中的 "丢失事件 "不敏感。最后，聚类差异表达基因分析证实了 nsDCC 结果的意义。总之，nsDCC 是分析和理解 scRNA-seq 数据的一种新方法。

{"title":"nsDCC: dual-level contrastive clustering with nonuniform sampling for scRNA-seq data analysis.","authors":"Linjie Wang, Wei Li, Fanghui Zhou, Kun Yu, Chaolu Feng, Dazhe Zhao","doi":"10.1093/bib/bbae477","DOIUrl":"https://doi.org/10.1093/bib/bbae477","url":null,"abstract":"Dimensionality reduction and clustering are crucial tasks in single-cell RNA sequencing (scRNA-seq) data analysis, treated independently in the current process, hindering their mutual benefits. The latest methods jointly optimize these tasks through deep clustering. However, contrastive learning, with powerful representation capability, can bridge the gap that common deep clustering methods face, which requires pre-defined cluster centers. Therefore, a dual-level contrastive clustering method with nonuniform sampling (nsDCC) is proposed for scRNA-seq data analysis. Dual-level contrastive clustering, which combines instance-level contrast and cluster-level contrast, jointly optimizes dimensionality reduction and clustering. Multi-positive contrastive learning and unit matrix constraint are introduced in instance- and cluster-level contrast, respectively. Furthermore, the attention mechanism is introduced to capture inter-cellular information, which is beneficial for clustering. The nsDCC focuses on important samples at category boundaries and in minority categories by the proposed nearest boundary sparsest density weight assignment algorithm, making it capable of capturing comprehensive characteristics against imbalanced datasets. Experimental results show that nsDCC outperforms the six other state-of-the-art methods on both real and simulated scRNA-seq data, validating its performance on dimensionality reduction and clustering of scRNA-seq data, especially for imbalanced data. Simulation experiments demonstrate that nsDCC is insensitive to \"dropout events\" in scRNA-seq. Finally, cluster differential expressed gene analysis confirms the meaningfulness of results from nsDCC. In summary, nsDCC is a new way of analyzing and understanding scRNA-seq data.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11427072/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Protein language models are performant in structure-free virtual screening. 蛋白质语言模型在无结构虚拟筛选中表现出色。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae480

Hilbert Yuen In Lam, Jia Sheng Guan, Xing Er Ong, Robbe Pincket, Yuguang Mu

Hitherto virtual screening (VS) has been typically performed using a structure-based drug design paradigm. Such methods typically require the use of molecular docking on high-resolution three-dimensional structures of a target protein-a computationally-intensive and time-consuming exercise. This work demonstrates that by employing protein language models and molecular graphs as inputs to a novel graph-to-transformer cross-attention mechanism, a screening power comparable to state-of-the-art structure-based models can be achieved. The implications thereof include highly expedited VS due to the greatly reduced compute required to run this model, and the ability to perform early stages of computer-aided drug design in the complete absence of 3D protein structures.

迄今为止，虚拟筛选（VS）通常采用基于结构的药物设计模式。这种方法通常需要在目标蛋白质的高分辨率三维结构上进行分子对接--计算密集且耗时。这项研究表明，将蛋白质语言模型和分子图作为新型图-转换器交叉注意机制的输入，可以实现与最先进的基于结构的模型相媲美的筛选能力。由于运行该模型所需的计算量大大减少，因此可以大大加快 VS 的速度，并能在完全没有三维蛋白质结构的情况下进行早期阶段的计算机辅助药物设计。

引用次数: 0

scDFN: enhancing single-cell RNA-seq clustering with deep fusion networks. scDFN：利用深度融合网络增强单细胞 RNA-seq 聚类。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae486

Tianxiang Liu, Cangzhi Jia, Yue Bi, Xudong Guo, Quan Zou, Fuyi Li

Single-cell ribonucleic acid sequencing (scRNA-seq) technology can be used to perform high-resolution analysis of the transcriptomes of individual cells. Therefore, its application has gained popularity for accurately analyzing the ever-increasing content of heterogeneous single-cell datasets. Central to interpreting scRNA-seq data is the clustering of cells to decipher transcriptomic diversity and infer cell behavior patterns. However, its complexity necessitates the application of advanced methodologies capable of resolving the inherent heterogeneity and limited gene expression characteristics of single-cell data. Herein, we introduce a novel deep learning-based algorithm for single-cell clustering, designated scDFN, which can significantly enhance the clustering of scRNA-seq data through a fusion network strategy. The scDFN algorithm applies a dual mechanism involving an autoencoder to extract attribute information and an improved graph autoencoder to capture topological nuances, integrated via a cross-network information fusion mechanism complemented by a triple self-supervision strategy. This fusion is optimized through a holistic consideration of four distinct loss functions. A comparative analysis with five leading scRNA-seq clustering methodologies across multiple datasets revealed the superiority of scDFN, as determined by better the Normalized Mutual Information (NMI) and the Adjusted Rand Index (ARI) metrics. Additionally, scDFN demonstrated robust multi-cluster dataset performance and exceptional resilience to batch effects. Ablation studies highlighted the key roles of the autoencoder and the improved graph autoencoder components, along with the critical contribution of the four joint loss functions to the overall efficacy of the algorithm. Through these advancements, scDFN set a new benchmark in single-cell clustering and can be used as an effective tool for the nuanced analysis of single-cell transcriptomics.

单细胞核糖核酸测序（scRNA-seq）技术可用于对单个细胞的转录组进行高分辨率分析。因此，该技术在准确分析日益增多的异质单细胞数据集方面得到了广泛应用。解释 scRNA-seq 数据的核心是对细胞进行聚类，以解读转录组多样性并推断细胞行为模式。然而，由于其复杂性，有必要应用先进的方法来解决单细胞数据固有的异质性和有限的基因表达特性。在本文中，我们介绍了一种基于深度学习的新型单细胞聚类算法--scDFN，它能通过融合网络策略显著增强scRNA-seq数据的聚类能力。scDFN 算法采用双重机制，包括提取属性信息的自动编码器和捕捉拓扑细微差别的改进图自动编码器，并通过跨网络信息融合机制与三重自监督策略相辅相成。这种融合通过对四种不同损失函数的综合考虑进行优化。在多个数据集上与五种领先的 scRNA-seq 聚类方法进行的比较分析表明，scDFN 的归一化互信息（NMI）和调整后兰德指数（ARI）指标更胜一筹。此外，scDFN 还表现出强大的多集群数据集性能和对批处理效应的超强适应能力。消融研究强调了自动编码器和改进的图自动编码器组件的关键作用，以及四个联合损失函数对算法整体功效的重要贡献。通过这些改进，scDFN 为单细胞聚类树立了新的标杆，并可作为一种有效工具用于单细胞转录组学的细微分析。

{"title":"scDFN: enhancing single-cell RNA-seq clustering with deep fusion networks.","authors":"Tianxiang Liu, Cangzhi Jia, Yue Bi, Xudong Guo, Quan Zou, Fuyi Li","doi":"10.1093/bib/bbae486","DOIUrl":"10.1093/bib/bbae486","url":null,"abstract":"Single-cell ribonucleic acid sequencing (scRNA-seq) technology can be used to perform high-resolution analysis of the transcriptomes of individual cells. Therefore, its application has gained popularity for accurately analyzing the ever-increasing content of heterogeneous single-cell datasets. Central to interpreting scRNA-seq data is the clustering of cells to decipher transcriptomic diversity and infer cell behavior patterns. However, its complexity necessitates the application of advanced methodologies capable of resolving the inherent heterogeneity and limited gene expression characteristics of single-cell data. Herein, we introduce a novel deep learning-based algorithm for single-cell clustering, designated scDFN, which can significantly enhance the clustering of scRNA-seq data through a fusion network strategy. The scDFN algorithm applies a dual mechanism involving an autoencoder to extract attribute information and an improved graph autoencoder to capture topological nuances, integrated via a cross-network information fusion mechanism complemented by a triple self-supervision strategy. This fusion is optimized through a holistic consideration of four distinct loss functions. A comparative analysis with five leading scRNA-seq clustering methodologies across multiple datasets revealed the superiority of scDFN, as determined by better the Normalized Mutual Information (NMI) and the Adjusted Rand Index (ARI) metrics. Additionally, scDFN demonstrated robust multi-cluster dataset performance and exceptional resilience to batch effects. Ablation studies highlighted the key roles of the autoencoder and the improved graph autoencoder components, along with the critical contribution of the four joint loss functions to the overall efficacy of the algorithm. Through these advancements, scDFN set a new benchmark in single-cell clustering and can be used as an effective tool for the nuanced analysis of single-cell transcriptomics.","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11456827/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142380070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Structure-preserved integration of scRNA-seq data using heterogeneous graph neural network. 利用异构图神经网络对 scRNA-seq 数据进行结构保留整合。

IF 6.8 2区生物学 Q1 BIOCHEMICAL RESEARCH METHODS

Briefings in bioinformatics

Pub Date : 2024-09-23 DOI: 10.1093/bib/bbae538

Xun Zhang, Kun Qian, Hongwei Li

The integration of single-cell RNA sequencing (scRNA-seq) data from multiple experimental batches enables more comprehensive characterizations of cell states. Given that existing methods disregard the structural information between cells and genes, we proposed a structure-preserved scRNA-seq data integration approach using heterogeneous graph neural network (scHetG). By establishing a heterogeneous graph that represents the interactions between multiple batches of cells and genes, and combining a heterogeneous graph neural network with contrastive learning, scHetG concurrently obtained cell and gene embeddings with structural information. A comprehensive assessment covering different species, tissues and scales indicated that scHetG is an efficacious method for eliminating batch effects while preserving the structural information of cells and genes, including batch-specific cell types and cell-type specific gene co-expression patterns.

整合来自多个实验批次的单细胞 RNA 测序（scRNA-seq）数据能更全面地描述细胞状态。鉴于现有方法忽略了细胞和基因之间的结构信息，我们提出了一种使用异质图神经网络（scHetG）的结构保留scRNA-seq数据整合方法。通过建立代表多批细胞和基因之间相互作用的异质图，并将异质图神经网络与对比学习相结合，scHetG同时获得了具有结构信息的细胞和基因嵌入。对不同物种、组织和尺度的综合评估表明，scHetG 是一种有效的方法，既能消除批次效应，又能保留细胞和基因的结构信息，包括特定批次的细胞类型和特定细胞类型的基因共表达模式。

引用次数: 0