Cell tracking is an essential function needed in automated cellular activity monitoring. In practice, processing methods striking a balance between computational efficiency and accuracy as well as demonstrating robust generalizability across diverse cell datasets are highly desired. This paper develops a central-metric fully test-time adaptive framework for cell tracking (CMTT-JTracker). Firstly, a CMTT mechanism is designed for the pre-segmentation of cell images, which enables extracting target information at different resolutions without additional training. Next, a multi-task learning network with the spatial attention scheme is developed to simultaneously realize detection and re-identification tasks based on features extracted by CMTT. Experimental results demonstrate that the CMTT-JTracker exhibits remarkable biological and tracking performance compared with benchmarking tracking methods. It achieves a multiple object tracking accuracy (MOTA) of $0.894$ on Fluo-N2DH-SIM+ and a MOTA of $0.850$ on PhC-C2DL-PSC. Experimental results further confirm that the CMTT applied solely as a segmentation unit outperforms the SOTA segmentation benchmarks on various datasets, particularly excelling in scenarios with dense cells. The Dice coefficients of the CMTT range from a high of $0.928$ to a low of $0.758$ across different datasets.
{"title":"CMTT-JTracker: a fully test-time adaptive framework serving automated cell lineage construction.","authors":"Liuyin Chen, Sanyuan Fu, Zijun Zhang","doi":"10.1093/bib/bbae591","DOIUrl":"10.1093/bib/bbae591","url":null,"abstract":"<p><p>Cell tracking is an essential function needed in automated cellular activity monitoring. In practice, processing methods striking a balance between computational efficiency and accuracy as well as demonstrating robust generalizability across diverse cell datasets are highly desired. This paper develops a central-metric fully test-time adaptive framework for cell tracking (CMTT-JTracker). Firstly, a CMTT mechanism is designed for the pre-segmentation of cell images, which enables extracting target information at different resolutions without additional training. Next, a multi-task learning network with the spatial attention scheme is developed to simultaneously realize detection and re-identification tasks based on features extracted by CMTT. Experimental results demonstrate that the CMTT-JTracker exhibits remarkable biological and tracking performance compared with benchmarking tracking methods. It achieves a multiple object tracking accuracy (MOTA) of $0.894$ on Fluo-N2DH-SIM+ and a MOTA of $0.850$ on PhC-C2DL-PSC. Experimental results further confirm that the CMTT applied solely as a segmentation unit outperforms the SOTA segmentation benchmarks on various datasets, particularly excelling in scenarios with dense cells. The Dice coefficients of the CMTT range from a high of $0.928$ to a low of $0.758$ across different datasets.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11570544/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The digital annealer (DA) leverages its computational capabilities of up to 100 000 bits to address the complex nondeterministic polynomial-time (NP)-complete challenge inherent in elucidating complex structures of natural products. Conventional computational methods often face limitations with complex mixtures, as they struggle to manage the high dimensionality and intertwined relationships typical in natural products, resulting in inefficiencies and inaccuracies. This study reformulates the challenge into a Quadratic Unconstrained Binary Optimization framework, thereby harnessing the quantum-inspired computing power of the DA. Utilizing mass spectrometry data from three distinct herb species and various potential scaffolds, the DA proficiently locates optimal sidechain combinations that adhere to predefined target molecular weights. This methodology enhances the probability of selecting appropriate sidechains and substituted positions and ensures the generation of solutions within a reasonable 5-min window. The findings underscore the transformative potential of the DA in the realms of analytical chemistry and drug discovery, markedly improving both the precision and practicality of natural product structure elucidation.
数字退火器(DA)利用其高达 100 000 位的计算能力,解决了阐明天然产物复杂结构所固有的复杂非确定性多项式时间(NP)完全挑战。传统的计算方法在处理复杂混合物时往往面临限制,因为它们难以处理天然产物中典型的高维度和相互交织的关系,导致效率低下和不准确。本研究将这一挑战重新表述为二次无约束二元优化框架,从而利用了 DA 的量子启发计算能力。利用来自三种不同草药物种和各种潜在支架的质谱数据,DA 能熟练地找到符合预定目标分子量的最佳侧链组合。这种方法提高了选择适当侧链和取代位置的概率,并确保在 5 分钟的合理时间窗口内生成解决方案。这些发现强调了 DA 在分析化学和药物发现领域的变革潜力,显著提高了天然产物结构阐释的精确性和实用性。
{"title":"Digital annealing optimization for natural product structure elucidation.","authors":"Chien Lee, Pei-Hua Wang, Yufeng Jane Tseng","doi":"10.1093/bib/bbae600","DOIUrl":"10.1093/bib/bbae600","url":null,"abstract":"<p><p>The digital annealer (DA) leverages its computational capabilities of up to 100 000 bits to address the complex nondeterministic polynomial-time (NP)-complete challenge inherent in elucidating complex structures of natural products. Conventional computational methods often face limitations with complex mixtures, as they struggle to manage the high dimensionality and intertwined relationships typical in natural products, resulting in inefficiencies and inaccuracies. This study reformulates the challenge into a Quadratic Unconstrained Binary Optimization framework, thereby harnessing the quantum-inspired computing power of the DA. Utilizing mass spectrometry data from three distinct herb species and various potential scaffolds, the DA proficiently locates optimal sidechain combinations that adhere to predefined target molecular weights. This methodology enhances the probability of selecting appropriate sidechains and substituted positions and ensures the generation of solutions within a reasonable 5-min window. The findings underscore the transformative potential of the DA in the realms of analytical chemistry and drug discovery, markedly improving both the precision and practicality of natural product structure elucidation.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11570542/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142647084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Y Fatemi, Yunrui Lu, Alos B Diallo, Gokul Srinivasan, Zarif L Azher, Brock C Christensen, Lucas A Salas, Gregory J Tsongalis, Scott M Palisoul, Laurent Perreard, Fred W Kolling, Louis J Vaickus, Joshua J Levy
The application of deep learning to spatial transcriptomics (ST) can reveal relationships between gene expression and tissue architecture. Prior work has demonstrated that inferring gene expression from tissue histomorphology can discern these spatial molecular markers to enable population scale studies, reducing the fiscal barriers associated with large-scale spatial profiling. However, while most improvements in algorithmic performance have focused on improving model architectures, little is known about how the quality of tissue preparation and imaging can affect deep learning model training for spatial inference from morphology and its potential for widespread clinical adoption. Prior studies for ST inference from histology typically utilize manually stained frozen sections with imaging on non-clinical grade scanners. Training such models on ST cohorts is also costly. We hypothesize that adopting tissue processing and imaging practices that mirror standards for clinical implementation (permanent sections, automated tissue staining, and clinical grade scanning) can significantly improve model performance. An enhanced specimen processing and imaging protocol was developed for deep learning-based ST inference from morphology. This protocol featured the Visium CytAssist assay to permit automated hematoxylin and eosin staining (e.g. Leica Bond), 40×-resolution imaging, and joining of multiple patients' tissue sections per capture area prior to ST profiling. Using a cohort of 13 pathologic T Stage-III stage colorectal cancer patients, we compared the performance of models trained on slide prepared using enhanced versus traditional (i.e. manual staining and low-resolution imaging) protocols. Leveraging Inceptionv3 neural networks, we predicted gene expression across serial, histologically-matched tissue sections using whole slide images (WSI) from both protocols. The data Shapley was used to quantify and compare marginal performance gains on a patient-by-patient basis attributed to using the enhanced protocol versus the actual costs of spatial profiling. Findings indicate that training and validating on WSI acquired through the enhanced protocol as opposed to the traditional method resulted in improved performance at lower fiscal cost. In the realm of ST, the enhancement of deep learning architectures frequently captures the spotlight; however, the significance of specimen processing and imaging is often understated. This research, informed through a game-theoretic lens, underscores the substantial impact that specimen preparation/imaging can have on spatial transcriptomic inference from morphology. It is essential to integrate such optimized processing protocols to facilitate the identification of prognostic markers at a larger scale.
将深度学习应用于空间转录组学(ST)可以揭示基因表达与组织结构之间的关系。之前的工作已经证明,从组织形态学推断基因表达可以发现这些空间分子标记,从而实现群体规模的研究,减少与大规模空间剖析相关的财政障碍。然而,算法性能的提高大多集中在模型架构的改进上,而对于组织制备和成像质量如何影响形态学空间推断的深度学习模型训练及其在临床上广泛应用的潜力却知之甚少。之前从组织学推断 ST 的研究通常使用人工染色的冷冻切片,并在非临床级扫描仪上进行成像。在 ST 队列上训练此类模型的成本也很高。我们假设,采用符合临床实施标准的组织处理和成像方法(永久切片、自动组织染色和临床级扫描)可以显著提高模型的性能。我们开发了一种增强型标本处理和成像方案,用于基于深度学习的ST形态推断。该方案以 Visium CytAssist 检测为特色,允许自动苏木精和伊红染色(如 Leica Bond)、40× 分辨率成像,并在 ST 分析之前将每个捕获区域的多个患者组织切片连接起来。我们使用一组 13 例病理 T-III 期结直肠癌患者,比较了在使用增强和传统(即手动染色和低分辨率成像)方案制备的切片上训练的模型的性能。利用 Inceptionv3 神经网络,我们使用这两种方案的全切片图像(WSI)预测了连续的、组织学上匹配的组织切片上的基因表达。数据 Shapley 用于量化和比较每个患者因使用增强型方案而获得的边际性能收益与空间剖析的实际成本。研究结果表明,与传统方法相比,通过增强型方案获得的 WSI 进行培训和验证,能以较低的财务成本提高性能。在 ST 领域,深度学习架构的增强经常成为焦点;然而,标本处理和成像的重要性往往被低估。这项研究通过博弈论的视角,强调了标本制备/成像对从形态学进行空间转录组推断的重大影响。必须整合这种优化的处理方案,以促进更大规模的预后标志物鉴定。
{"title":"An initial game-theoretic assessment of enhanced tissue preparation and imaging protocols for improved deep learning inference of spatial transcriptomics from tissue morphology.","authors":"Michael Y Fatemi, Yunrui Lu, Alos B Diallo, Gokul Srinivasan, Zarif L Azher, Brock C Christensen, Lucas A Salas, Gregory J Tsongalis, Scott M Palisoul, Laurent Perreard, Fred W Kolling, Louis J Vaickus, Joshua J Levy","doi":"10.1093/bib/bbae476","DOIUrl":"10.1093/bib/bbae476","url":null,"abstract":"<p><p>The application of deep learning to spatial transcriptomics (ST) can reveal relationships between gene expression and tissue architecture. Prior work has demonstrated that inferring gene expression from tissue histomorphology can discern these spatial molecular markers to enable population scale studies, reducing the fiscal barriers associated with large-scale spatial profiling. However, while most improvements in algorithmic performance have focused on improving model architectures, little is known about how the quality of tissue preparation and imaging can affect deep learning model training for spatial inference from morphology and its potential for widespread clinical adoption. Prior studies for ST inference from histology typically utilize manually stained frozen sections with imaging on non-clinical grade scanners. Training such models on ST cohorts is also costly. We hypothesize that adopting tissue processing and imaging practices that mirror standards for clinical implementation (permanent sections, automated tissue staining, and clinical grade scanning) can significantly improve model performance. An enhanced specimen processing and imaging protocol was developed for deep learning-based ST inference from morphology. This protocol featured the Visium CytAssist assay to permit automated hematoxylin and eosin staining (e.g. Leica Bond), 40×-resolution imaging, and joining of multiple patients' tissue sections per capture area prior to ST profiling. Using a cohort of 13 pathologic T Stage-III stage colorectal cancer patients, we compared the performance of models trained on slide prepared using enhanced versus traditional (i.e. manual staining and low-resolution imaging) protocols. Leveraging Inceptionv3 neural networks, we predicted gene expression across serial, histologically-matched tissue sections using whole slide images (WSI) from both protocols. The data Shapley was used to quantify and compare marginal performance gains on a patient-by-patient basis attributed to using the enhanced protocol versus the actual costs of spatial profiling. Findings indicate that training and validating on WSI acquired through the enhanced protocol as opposed to the traditional method resulted in improved performance at lower fiscal cost. In the realm of ST, the enhancement of deep learning architectures frequently captures the spotlight; however, the significance of specimen processing and imaging is often understated. This research, informed through a game-theoretic lens, underscores the substantial impact that specimen preparation/imaging can have on spatial transcriptomic inference from morphology. It is essential to integrate such optimized processing protocols to facilitate the identification of prognostic markers at a larger scale.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11452536/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142375186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As a main subtype of post-translational modification (PTM), protein lysine acylations (PLAs) play crucial roles in regulating diverse functions of proteins. With recent advancements in proteomics technology, the identification of PTM is becoming a data-rich field. A large amount of experimentally verified data is urgently required to be translated into valuable biological insights. With computational approaches, PLA can be accurately detected across the whole proteome, even for organisms with small-scale datasets. Herein, a comprehensive summary of 166 in silico PLA prediction methods is presented, including a single type of PLA site and multiple types of PLA sites. This recapitulation covers important aspects that are critical for the development of a robust predictor, including data collection and preparation, sample selection, feature representation, classification algorithm design, model evaluation, and method availability. Notably, we discuss the application of protein language models and transfer learning to solve the small-sample learning issue. We also highlight the prediction methods developed for functionally relevant PLA sites and species/substrate/cell-type-specific PLA sites. In conclusion, this systematic review could potentially facilitate the development of novel PLA predictors and offer useful insights to researchers from various disciplines.
{"title":"Current computational tools for protein lysine acylation site prediction.","authors":"Zhaohui Qin, Haoran Ren, Pei Zhao, Kaiyuan Wang, Huixia Liu, Chunbo Miao, Yanxiu Du, Junzhou Li, Liuji Wu, Zhen Chen","doi":"10.1093/bib/bbae469","DOIUrl":"10.1093/bib/bbae469","url":null,"abstract":"<p><p>As a main subtype of post-translational modification (PTM), protein lysine acylations (PLAs) play crucial roles in regulating diverse functions of proteins. With recent advancements in proteomics technology, the identification of PTM is becoming a data-rich field. A large amount of experimentally verified data is urgently required to be translated into valuable biological insights. With computational approaches, PLA can be accurately detected across the whole proteome, even for organisms with small-scale datasets. Herein, a comprehensive summary of 166 in silico PLA prediction methods is presented, including a single type of PLA site and multiple types of PLA sites. This recapitulation covers important aspects that are critical for the development of a robust predictor, including data collection and preparation, sample selection, feature representation, classification algorithm design, model evaluation, and method availability. Notably, we discuss the application of protein language models and transfer learning to solve the small-sample learning issue. We also highlight the prediction methods developed for functionally relevant PLA sites and species/substrate/cell-type-specific PLA sites. In conclusion, this systematic review could potentially facilitate the development of novel PLA predictors and offer useful insights to researchers from various disciplines.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11421846/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Accurate and efficient prediction of polymers properties is crucial for polymer design. Recently, data-driven artificial intelligence (AI) models have demonstrated great promise in polymers property analysis. Even with the great progresses, a pivotal challenge in all the AI-driven models remains to be the effective representation of molecules. Here we introduce Multi-Cover Persistence (MCP)-based molecular representation and featurization for the first time. Our MCP-based polymer descriptors are combined with machine learning models, in particular, Gradient Boosting Tree (GBT) models, for polymers property prediction. Different from all previous molecular representation, polymer molecular structure and interactions are represented as MCP, which utilizes Delaunay slices at different dimensions and Rhomboid tiling to characterize the complicated geometric and topological information within the data. Statistic features from the generated persistent barcodes are used as polymer descriptors, and further combined with GBT model. Our model has been extensively validated on polymer benchmark datasets. It has been found that our models can outperform traditional fingerprint-based models and has similar accuracy with geometric deep learning models. In particular, our model tends to be more effective on large-sized monomer structures, demonstrating the great potential of MCP in characterizing more complicated polymer data. This work underscores the potential of MCP in polymer informatics, presenting a novel perspective on molecular representation and its application in polymer science.
{"title":"Multi-Cover Persistence (MCP)-based machine learning for polymer property prediction.","authors":"Yipeng Zhang, Cong Shen, Kelin Xia","doi":"10.1093/bib/bbae465","DOIUrl":"https://doi.org/10.1093/bib/bbae465","url":null,"abstract":"<p><p>Accurate and efficient prediction of polymers properties is crucial for polymer design. Recently, data-driven artificial intelligence (AI) models have demonstrated great promise in polymers property analysis. Even with the great progresses, a pivotal challenge in all the AI-driven models remains to be the effective representation of molecules. Here we introduce Multi-Cover Persistence (MCP)-based molecular representation and featurization for the first time. Our MCP-based polymer descriptors are combined with machine learning models, in particular, Gradient Boosting Tree (GBT) models, for polymers property prediction. Different from all previous molecular representation, polymer molecular structure and interactions are represented as MCP, which utilizes Delaunay slices at different dimensions and Rhomboid tiling to characterize the complicated geometric and topological information within the data. Statistic features from the generated persistent barcodes are used as polymer descriptors, and further combined with GBT model. Our model has been extensively validated on polymer benchmark datasets. It has been found that our models can outperform traditional fingerprint-based models and has similar accuracy with geometric deep learning models. In particular, our model tends to be more effective on large-sized monomer structures, demonstrating the great potential of MCP in characterizing more complicated polymer data. This work underscores the potential of MCP in polymer informatics, presenting a novel perspective on molecular representation and its application in polymer science.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11424509/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zilin Ren, Jiarong Zhang, Yixiang Zhang, Tingting Yang, Pingping Sun, Jiguo Xue, Xiaochen Bo, Bo Zhou, Jiangwei Yan, Ming Ni
Short-tandem repeats (STRs) are the type of genetic markers extensively utilized in biomedical and forensic applications. Due to sequencing noise in nanopore sequencing, accurate analysis methods are lacking. We developed NASTRA, an innovative tool for Nanopore Autosomal Short Tandem Repeat Analysis, which overcomes traditional database-based methods' limitations and provides a precise germline analysis of STR genetic markers without the need for allele sequence reference. Demonstrating high accuracy in cell line authentication testing and paternity testing, NASTRA significantly surpasses existing methods in both speed and accuracy. This advancement makes it a promising solution for rapid cell line authentication and kinship testing, highlighting the potential of nanopore sequencing for in-field applications.
{"title":"NASTRA: accurate analysis of short tandem repeat markers by nanopore sequencing with repeat-structure-aware algorithm.","authors":"Zilin Ren, Jiarong Zhang, Yixiang Zhang, Tingting Yang, Pingping Sun, Jiguo Xue, Xiaochen Bo, Bo Zhou, Jiangwei Yan, Ming Ni","doi":"10.1093/bib/bbae472","DOIUrl":"https://doi.org/10.1093/bib/bbae472","url":null,"abstract":"<p><p>Short-tandem repeats (STRs) are the type of genetic markers extensively utilized in biomedical and forensic applications. Due to sequencing noise in nanopore sequencing, accurate analysis methods are lacking. We developed NASTRA, an innovative tool for Nanopore Autosomal Short Tandem Repeat Analysis, which overcomes traditional database-based methods' limitations and provides a precise germline analysis of STR genetic markers without the need for allele sequence reference. Demonstrating high accuracy in cell line authentication testing and paternity testing, NASTRA significantly surpasses existing methods in both speed and accuracy. This advancement makes it a promising solution for rapid cell line authentication and kinship testing, highlighting the potential of nanopore sequencing for in-field applications.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11424183/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dimensionality reduction and clustering are crucial tasks in single-cell RNA sequencing (scRNA-seq) data analysis, treated independently in the current process, hindering their mutual benefits. The latest methods jointly optimize these tasks through deep clustering. However, contrastive learning, with powerful representation capability, can bridge the gap that common deep clustering methods face, which requires pre-defined cluster centers. Therefore, a dual-level contrastive clustering method with nonuniform sampling (nsDCC) is proposed for scRNA-seq data analysis. Dual-level contrastive clustering, which combines instance-level contrast and cluster-level contrast, jointly optimizes dimensionality reduction and clustering. Multi-positive contrastive learning and unit matrix constraint are introduced in instance- and cluster-level contrast, respectively. Furthermore, the attention mechanism is introduced to capture inter-cellular information, which is beneficial for clustering. The nsDCC focuses on important samples at category boundaries and in minority categories by the proposed nearest boundary sparsest density weight assignment algorithm, making it capable of capturing comprehensive characteristics against imbalanced datasets. Experimental results show that nsDCC outperforms the six other state-of-the-art methods on both real and simulated scRNA-seq data, validating its performance on dimensionality reduction and clustering of scRNA-seq data, especially for imbalanced data. Simulation experiments demonstrate that nsDCC is insensitive to "dropout events" in scRNA-seq. Finally, cluster differential expressed gene analysis confirms the meaningfulness of results from nsDCC. In summary, nsDCC is a new way of analyzing and understanding scRNA-seq data.
{"title":"nsDCC: dual-level contrastive clustering with nonuniform sampling for scRNA-seq data analysis.","authors":"Linjie Wang, Wei Li, Fanghui Zhou, Kun Yu, Chaolu Feng, Dazhe Zhao","doi":"10.1093/bib/bbae477","DOIUrl":"https://doi.org/10.1093/bib/bbae477","url":null,"abstract":"<p><p>Dimensionality reduction and clustering are crucial tasks in single-cell RNA sequencing (scRNA-seq) data analysis, treated independently in the current process, hindering their mutual benefits. The latest methods jointly optimize these tasks through deep clustering. However, contrastive learning, with powerful representation capability, can bridge the gap that common deep clustering methods face, which requires pre-defined cluster centers. Therefore, a dual-level contrastive clustering method with nonuniform sampling (nsDCC) is proposed for scRNA-seq data analysis. Dual-level contrastive clustering, which combines instance-level contrast and cluster-level contrast, jointly optimizes dimensionality reduction and clustering. Multi-positive contrastive learning and unit matrix constraint are introduced in instance- and cluster-level contrast, respectively. Furthermore, the attention mechanism is introduced to capture inter-cellular information, which is beneficial for clustering. The nsDCC focuses on important samples at category boundaries and in minority categories by the proposed nearest boundary sparsest density weight assignment algorithm, making it capable of capturing comprehensive characteristics against imbalanced datasets. Experimental results show that nsDCC outperforms the six other state-of-the-art methods on both real and simulated scRNA-seq data, validating its performance on dimensionality reduction and clustering of scRNA-seq data, especially for imbalanced data. Simulation experiments demonstrate that nsDCC is insensitive to \"dropout events\" in scRNA-seq. Finally, cluster differential expressed gene analysis confirms the meaningfulness of results from nsDCC. In summary, nsDCC is a new way of analyzing and understanding scRNA-seq data.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11427072/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hilbert Yuen In Lam, Jia Sheng Guan, Xing Er Ong, Robbe Pincket, Yuguang Mu
Hitherto virtual screening (VS) has been typically performed using a structure-based drug design paradigm. Such methods typically require the use of molecular docking on high-resolution three-dimensional structures of a target protein-a computationally-intensive and time-consuming exercise. This work demonstrates that by employing protein language models and molecular graphs as inputs to a novel graph-to-transformer cross-attention mechanism, a screening power comparable to state-of-the-art structure-based models can be achieved. The implications thereof include highly expedited VS due to the greatly reduced compute required to run this model, and the ability to perform early stages of computer-aided drug design in the complete absence of 3D protein structures.
迄今为止,虚拟筛选(VS)通常采用基于结构的药物设计模式。这种方法通常需要在目标蛋白质的高分辨率三维结构上进行分子对接--计算密集且耗时。这项研究表明,将蛋白质语言模型和分子图作为新型图-转换器交叉注意机制的输入,可以实现与最先进的基于结构的模型相媲美的筛选能力。由于运行该模型所需的计算量大大减少,因此可以大大加快 VS 的速度,并能在完全没有三维蛋白质结构的情况下进行早期阶段的计算机辅助药物设计。
{"title":"Protein language models are performant in structure-free virtual screening.","authors":"Hilbert Yuen In Lam, Jia Sheng Guan, Xing Er Ong, Robbe Pincket, Yuguang Mu","doi":"10.1093/bib/bbae480","DOIUrl":"https://doi.org/10.1093/bib/bbae480","url":null,"abstract":"<p><p>Hitherto virtual screening (VS) has been typically performed using a structure-based drug design paradigm. Such methods typically require the use of molecular docking on high-resolution three-dimensional structures of a target protein-a computationally-intensive and time-consuming exercise. This work demonstrates that by employing protein language models and molecular graphs as inputs to a novel graph-to-transformer cross-attention mechanism, a screening power comparable to state-of-the-art structure-based models can be achieved. The implications thereof include highly expedited VS due to the greatly reduced compute required to run this model, and the ability to perform early stages of computer-aided drug design in the complete absence of 3D protein structures.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11427677/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142341951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianxiang Liu, Cangzhi Jia, Yue Bi, Xudong Guo, Quan Zou, Fuyi Li
Single-cell ribonucleic acid sequencing (scRNA-seq) technology can be used to perform high-resolution analysis of the transcriptomes of individual cells. Therefore, its application has gained popularity for accurately analyzing the ever-increasing content of heterogeneous single-cell datasets. Central to interpreting scRNA-seq data is the clustering of cells to decipher transcriptomic diversity and infer cell behavior patterns. However, its complexity necessitates the application of advanced methodologies capable of resolving the inherent heterogeneity and limited gene expression characteristics of single-cell data. Herein, we introduce a novel deep learning-based algorithm for single-cell clustering, designated scDFN, which can significantly enhance the clustering of scRNA-seq data through a fusion network strategy. The scDFN algorithm applies a dual mechanism involving an autoencoder to extract attribute information and an improved graph autoencoder to capture topological nuances, integrated via a cross-network information fusion mechanism complemented by a triple self-supervision strategy. This fusion is optimized through a holistic consideration of four distinct loss functions. A comparative analysis with five leading scRNA-seq clustering methodologies across multiple datasets revealed the superiority of scDFN, as determined by better the Normalized Mutual Information (NMI) and the Adjusted Rand Index (ARI) metrics. Additionally, scDFN demonstrated robust multi-cluster dataset performance and exceptional resilience to batch effects. Ablation studies highlighted the key roles of the autoencoder and the improved graph autoencoder components, along with the critical contribution of the four joint loss functions to the overall efficacy of the algorithm. Through these advancements, scDFN set a new benchmark in single-cell clustering and can be used as an effective tool for the nuanced analysis of single-cell transcriptomics.
{"title":"scDFN: enhancing single-cell RNA-seq clustering with deep fusion networks.","authors":"Tianxiang Liu, Cangzhi Jia, Yue Bi, Xudong Guo, Quan Zou, Fuyi Li","doi":"10.1093/bib/bbae486","DOIUrl":"10.1093/bib/bbae486","url":null,"abstract":"<p><p>Single-cell ribonucleic acid sequencing (scRNA-seq) technology can be used to perform high-resolution analysis of the transcriptomes of individual cells. Therefore, its application has gained popularity for accurately analyzing the ever-increasing content of heterogeneous single-cell datasets. Central to interpreting scRNA-seq data is the clustering of cells to decipher transcriptomic diversity and infer cell behavior patterns. However, its complexity necessitates the application of advanced methodologies capable of resolving the inherent heterogeneity and limited gene expression characteristics of single-cell data. Herein, we introduce a novel deep learning-based algorithm for single-cell clustering, designated scDFN, which can significantly enhance the clustering of scRNA-seq data through a fusion network strategy. The scDFN algorithm applies a dual mechanism involving an autoencoder to extract attribute information and an improved graph autoencoder to capture topological nuances, integrated via a cross-network information fusion mechanism complemented by a triple self-supervision strategy. This fusion is optimized through a holistic consideration of four distinct loss functions. A comparative analysis with five leading scRNA-seq clustering methodologies across multiple datasets revealed the superiority of scDFN, as determined by better the Normalized Mutual Information (NMI) and the Adjusted Rand Index (ARI) metrics. Additionally, scDFN demonstrated robust multi-cluster dataset performance and exceptional resilience to batch effects. Ablation studies highlighted the key roles of the autoencoder and the improved graph autoencoder components, along with the critical contribution of the four joint loss functions to the overall efficacy of the algorithm. Through these advancements, scDFN set a new benchmark in single-cell clustering and can be used as an effective tool for the nuanced analysis of single-cell transcriptomics.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11456827/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142380070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The integration of single-cell RNA sequencing (scRNA-seq) data from multiple experimental batches enables more comprehensive characterizations of cell states. Given that existing methods disregard the structural information between cells and genes, we proposed a structure-preserved scRNA-seq data integration approach using heterogeneous graph neural network (scHetG). By establishing a heterogeneous graph that represents the interactions between multiple batches of cells and genes, and combining a heterogeneous graph neural network with contrastive learning, scHetG concurrently obtained cell and gene embeddings with structural information. A comprehensive assessment covering different species, tissues and scales indicated that scHetG is an efficacious method for eliminating batch effects while preserving the structural information of cells and genes, including batch-specific cell types and cell-type specific gene co-expression patterns.
{"title":"Structure-preserved integration of scRNA-seq data using heterogeneous graph neural network.","authors":"Xun Zhang, Kun Qian, Hongwei Li","doi":"10.1093/bib/bbae538","DOIUrl":"https://doi.org/10.1093/bib/bbae538","url":null,"abstract":"<p><p>The integration of single-cell RNA sequencing (scRNA-seq) data from multiple experimental batches enables more comprehensive characterizations of cell states. Given that existing methods disregard the structural information between cells and genes, we proposed a structure-preserved scRNA-seq data integration approach using heterogeneous graph neural network (scHetG). By establishing a heterogeneous graph that represents the interactions between multiple batches of cells and genes, and combining a heterogeneous graph neural network with contrastive learning, scHetG concurrently obtained cell and gene embeddings with structural information. A comprehensive assessment covering different species, tissues and scales indicated that scHetG is an efficacious method for eliminating batch effects while preserving the structural information of cells and genes, including batch-specific cell types and cell-type specific gene co-expression patterns.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"25 6","pages":""},"PeriodicalIF":6.8,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11500609/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142495367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}