Pub Date : 2025-02-04DOI: 10.1093/bioinformatics/btaf027
Yifeng Fu, Hong Qu, Dacheng Qu, Min Zhao
Motivation: Understanding cell differentiation and development dynamics is key for single-cell transcriptome analysis. Current cell differentiation trajectory inference algorithms face challenges such as high dimensionality, noise, and a need for users to possess certain biological information about the datasets to effectively utilize the algorithms. Here, we introduce Trajectory Inference with Cell-Cell Interaction (TICCI), a novel way to address these challenges by integrating intercellular communication information. In recognizing crucial intercellular communication during development, TICCI proposes Cell-Cell Interactions (CCI) at single-cell resolution. We posit that cells exhibiting higher gene expression similarity patterns are more likely to exchange information via biomolecular mediators.
Results: TICCI is initiated by constructing a cell-neighborhood matrix using edge weights composed of intercellular similarity and CCI information. Louvain partitioning identifies trajectory branches, attenuating noise, while single-cell entropy (scEntropy) is used to assess differentiation status. The Chu-Liu algorithm constructs a directed least-square model to identify trajectory branches, and an improved diffusion fitted time algorithm computes cell-fitted time in nonconnected topologies. TICCI validation on single-cell RNA sequencing (scRNA-seq) datasets confirms the accuracy of cell trajectories, aligning with genealogical branching and gene markers. Verification using extrinsic information labels demonstrates CCI information utility in enhancing accurate trajectory inference. A comparative analysis establishes TICCI proficiency in accurate temporal ordering.
Availability and implementation: Source code and binaries freely available for download at https://github.com/mine41/TICCI, implemented in R (version 4.32) and Python (version 3.7.16) and supported on MS Windows. Authors ensure that the software is available for a full two years following publication.
{"title":"Trajectory Inference with Cell-Cell Interactions (TICCI): intercellular communication improves the accuracy of trajectory inference methods.","authors":"Yifeng Fu, Hong Qu, Dacheng Qu, Min Zhao","doi":"10.1093/bioinformatics/btaf027","DOIUrl":"10.1093/bioinformatics/btaf027","url":null,"abstract":"<p><strong>Motivation: </strong>Understanding cell differentiation and development dynamics is key for single-cell transcriptome analysis. Current cell differentiation trajectory inference algorithms face challenges such as high dimensionality, noise, and a need for users to possess certain biological information about the datasets to effectively utilize the algorithms. Here, we introduce Trajectory Inference with Cell-Cell Interaction (TICCI), a novel way to address these challenges by integrating intercellular communication information. In recognizing crucial intercellular communication during development, TICCI proposes Cell-Cell Interactions (CCI) at single-cell resolution. We posit that cells exhibiting higher gene expression similarity patterns are more likely to exchange information via biomolecular mediators.</p><p><strong>Results: </strong>TICCI is initiated by constructing a cell-neighborhood matrix using edge weights composed of intercellular similarity and CCI information. Louvain partitioning identifies trajectory branches, attenuating noise, while single-cell entropy (scEntropy) is used to assess differentiation status. The Chu-Liu algorithm constructs a directed least-square model to identify trajectory branches, and an improved diffusion fitted time algorithm computes cell-fitted time in nonconnected topologies. TICCI validation on single-cell RNA sequencing (scRNA-seq) datasets confirms the accuracy of cell trajectories, aligning with genealogical branching and gene markers. Verification using extrinsic information labels demonstrates CCI information utility in enhancing accurate trajectory inference. A comparative analysis establishes TICCI proficiency in accurate temporal ordering.</p><p><strong>Availability and implementation: </strong>Source code and binaries freely available for download at https://github.com/mine41/TICCI, implemented in R (version 4.32) and Python (version 3.7.16) and supported on MS Windows. Authors ensure that the software is available for a full two years following publication.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11829803/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143082557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1093/bioinformatics/btaf034
Seokyoung Hong, Krishna Gopal Chattaraj, Jing Guo, Bernhardt L Trout, Richard D Braatz
Motivation: The accurate prediction of O-GlcNAcylation sites is crucial for understanding disease mechanisms and developing effective treatments. Previous machine learning (ML) models primarily relied on primary or secondary protein structural and related properties, which have limitations in capturing the spatial interactions of neighboring amino acids. This study introduces local environmental features as a novel approach that incorporates three-dimensional spatial information, significantly improving model performance by considering the spatial context around the target site. Additionally, we utilize sparse recurrent neural networks to effectively capture sequential nature of the proteins and to identify key factors influencing O-GlcNAcylation as an explainable ML model.
Results: Our findings demonstrate the effectiveness of our proposed features with the model achieving an F1 score of 28.3%, as well as feature selection capability with the model using only the top 20% of features achieving the highest F1 score of 32.02%, a 1.4-fold improvement over existing PTM models. Statistical analysis of the top 20 features confirmed their consistency with literature. This method not only boosts prediction accuracy but also paves the way for further research in understanding and targeting O-GlcNAcylation.
Availability and implementation: The entire code, data, features used in this study are available in the GitHub repository: https://github.com/pseokyoung/o-glcnac-prediction.
{"title":"Enhanced O-glycosylation site prediction using explainable machine learning technique with spatial local environment.","authors":"Seokyoung Hong, Krishna Gopal Chattaraj, Jing Guo, Bernhardt L Trout, Richard D Braatz","doi":"10.1093/bioinformatics/btaf034","DOIUrl":"10.1093/bioinformatics/btaf034","url":null,"abstract":"<p><strong>Motivation: </strong>The accurate prediction of O-GlcNAcylation sites is crucial for understanding disease mechanisms and developing effective treatments. Previous machine learning (ML) models primarily relied on primary or secondary protein structural and related properties, which have limitations in capturing the spatial interactions of neighboring amino acids. This study introduces local environmental features as a novel approach that incorporates three-dimensional spatial information, significantly improving model performance by considering the spatial context around the target site. Additionally, we utilize sparse recurrent neural networks to effectively capture sequential nature of the proteins and to identify key factors influencing O-GlcNAcylation as an explainable ML model.</p><p><strong>Results: </strong>Our findings demonstrate the effectiveness of our proposed features with the model achieving an F1 score of 28.3%, as well as feature selection capability with the model using only the top 20% of features achieving the highest F1 score of 32.02%, a 1.4-fold improvement over existing PTM models. Statistical analysis of the top 20 features confirmed their consistency with literature. This method not only boosts prediction accuracy but also paves the way for further research in understanding and targeting O-GlcNAcylation.</p><p><strong>Availability and implementation: </strong>The entire code, data, features used in this study are available in the GitHub repository: https://github.com/pseokyoung/o-glcnac-prediction.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11814488/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143061569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1093/bioinformatics/btaf049
Max R Brown, Pablo Manuel Gonzalez de La Rosa, Mark Blaxter
Summary: "tidk" (short for telomere identification toolkit) uses a simple, fast algorithm to scan long DNA reads for the presence of short tandemly repeated DNA in runs, and to aggregate them based on canonical DNA string representation. These are telomeric repeat candidates. Our algorithm is shown to be accurate in genomes for which the telomeric repeat unit is known and is tested across a wide variety of newly assembled genomes to uncover new telomeric repeat units. Tools are provided to identify telomeric repeats de novo, scan genomes for known telomeric repeats, and to visualize telomeric repeats on the assembly. "tidk" is implemented in Rust and is available as a command line tool which can be compiled using the Rust toolchain or downloaded as a binary from bioconda.
Availability and implementation: The "tidk" Rust crate is freely available under the MIT license (https://crates.io/crates/tidk), and the source code is available at https://github.com/tolkit/telomeric-identifier.
{"title":"tidk: a toolkit to rapidly identify telomeric repeats from genomic datasets.","authors":"Max R Brown, Pablo Manuel Gonzalez de La Rosa, Mark Blaxter","doi":"10.1093/bioinformatics/btaf049","DOIUrl":"10.1093/bioinformatics/btaf049","url":null,"abstract":"<p><strong>Summary: </strong>\"tidk\" (short for telomere identification toolkit) uses a simple, fast algorithm to scan long DNA reads for the presence of short tandemly repeated DNA in runs, and to aggregate them based on canonical DNA string representation. These are telomeric repeat candidates. Our algorithm is shown to be accurate in genomes for which the telomeric repeat unit is known and is tested across a wide variety of newly assembled genomes to uncover new telomeric repeat units. Tools are provided to identify telomeric repeats de novo, scan genomes for known telomeric repeats, and to visualize telomeric repeats on the assembly. \"tidk\" is implemented in Rust and is available as a command line tool which can be compiled using the Rust toolchain or downloaded as a binary from bioconda.</p><p><strong>Availability and implementation: </strong>The \"tidk\" Rust crate is freely available under the MIT license (https://crates.io/crates/tidk), and the source code is available at https://github.com/tolkit/telomeric-identifier.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11814493/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143076695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1093/bioinformatics/btaf033
Niels Benjamin Paul, Jonas Chanrithy Wolber, Malte Lennart Sahrhage, Tim Beißbarth, Martin Haubrock
Motivation: Histone modifications play an important role in transcription regulation. Although the general importance of some histone modifications for transcription regulation has been previously established, the relevance of others and their interaction is subject to ongoing research. By training Machine Learning models to predict a gene's expression and explaining their decision making process, we can get hints on how histone modifications affect transcription. In previous studies, trained models were either hardly explainable or the models were trained solely on the abundance of histone modifications. Based on other studies, which used histone modification patterns, rather than their abundance, to identify potential regulatory elements, we hypothesize the histone modification pattern in a gene's promoter to be more predictive for gene expression. We used an optimization algorithm to extract predictive histone modification profiles.
Results: Our algorithm called PatternChrome achieved an average area under curve (AUC) score of 0.9029 over 56 samples for binary classification, outperforming all previous algorithms for the same task. We explained the models decisions to deduce the effect of specific features, certain histone modifications or promoter positions on transcription regulation. Although the predictive histone modification patterns were extracted for each sample separately, they can be used to predict gene expression in other samples, implying that the created patterns are largely generalizable. Interestingly, the impact of histone modifications on gene regulation appears predominantly indifferent to cellular specificity. Through explanation of the classifier's decisions, we substantiate established literature knowledge while concurrently revealing novel insights into the intricate landscape of transcriptional regulation via histone modification.
Availability and implementation: The code for the PatternChrome algorithm, the scripts for the analyses and the required data can be found at (https://gitlab.gwdg.de/MedBioinf/generegulation/patternchrome).
{"title":"Prediction of gene expression using histone modification patterns extracted by Particle Swarm Optimization.","authors":"Niels Benjamin Paul, Jonas Chanrithy Wolber, Malte Lennart Sahrhage, Tim Beißbarth, Martin Haubrock","doi":"10.1093/bioinformatics/btaf033","DOIUrl":"10.1093/bioinformatics/btaf033","url":null,"abstract":"<p><strong>Motivation: </strong>Histone modifications play an important role in transcription regulation. Although the general importance of some histone modifications for transcription regulation has been previously established, the relevance of others and their interaction is subject to ongoing research. By training Machine Learning models to predict a gene's expression and explaining their decision making process, we can get hints on how histone modifications affect transcription. In previous studies, trained models were either hardly explainable or the models were trained solely on the abundance of histone modifications. Based on other studies, which used histone modification patterns, rather than their abundance, to identify potential regulatory elements, we hypothesize the histone modification pattern in a gene's promoter to be more predictive for gene expression. We used an optimization algorithm to extract predictive histone modification profiles.</p><p><strong>Results: </strong>Our algorithm called PatternChrome achieved an average area under curve (AUC) score of 0.9029 over 56 samples for binary classification, outperforming all previous algorithms for the same task. We explained the models decisions to deduce the effect of specific features, certain histone modifications or promoter positions on transcription regulation. Although the predictive histone modification patterns were extracted for each sample separately, they can be used to predict gene expression in other samples, implying that the created patterns are largely generalizable. Interestingly, the impact of histone modifications on gene regulation appears predominantly indifferent to cellular specificity. Through explanation of the classifier's decisions, we substantiate established literature knowledge while concurrently revealing novel insights into the intricate landscape of transcriptional regulation via histone modification.</p><p><strong>Availability and implementation: </strong>The code for the PatternChrome algorithm, the scripts for the analyses and the required data can be found at (https://gitlab.gwdg.de/MedBioinf/generegulation/patternchrome).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11802466/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143061528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Summary: MUSET is a novel set of utilities designed to efficiently construct abundance unitig matrices from sequencing data. Unitig matrices extend the concept of k-mer matrices by merging overlapping k-mers that unambiguously belong to the same sequence. MUSET addresses the limitations of current software by integrating k-mer counting and unitig extraction to generate unitig matrices containing abundance values, as opposed to only presence-absence in previous tools. These matrices preserve variations between samples while reducing disk space and the number of rows compared to k-mer matrices. We evaluated MUSET's performance using datasets derived from a 618-GB collection of ancient oral sequencing samples, producing a filtered unitig matrix that records abundances in less than 10 hours and 20 GB memory.
Availability and implementation: MUSET is open source and publicly available under the AGPL-3.0 licence in GitHub at https://github.com/CamilaDuitama/muset. Source code is implemented in C ++ and provided with kmat_tools, a collection of tools for processing k-mer matrices. Version v0.5.1 is available on Zenodo with DOI 10.5281/zenodo.14164801.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"MUSET: Set of utilities for constructing abundance unitig matrices from sequencing data.","authors":"Riccardo Vicedomini, Francesco Andreace, Yoann Dufresne, Rayan Chikhi, Camila Duitama González","doi":"10.1093/bioinformatics/btaf054","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf054","url":null,"abstract":"<p><strong>Summary: </strong>MUSET is a novel set of utilities designed to efficiently construct abundance unitig matrices from sequencing data. Unitig matrices extend the concept of k-mer matrices by merging overlapping k-mers that unambiguously belong to the same sequence. MUSET addresses the limitations of current software by integrating k-mer counting and unitig extraction to generate unitig matrices containing abundance values, as opposed to only presence-absence in previous tools. These matrices preserve variations between samples while reducing disk space and the number of rows compared to k-mer matrices. We evaluated MUSET's performance using datasets derived from a 618-GB collection of ancient oral sequencing samples, producing a filtered unitig matrix that records abundances in less than 10 hours and 20 GB memory.</p><p><strong>Availability and implementation: </strong>MUSET is open source and publicly available under the AGPL-3.0 licence in GitHub at https://github.com/CamilaDuitama/muset. Source code is implemented in C ++ and provided with kmat_tools, a collection of tools for processing k-mer matrices. Version v0.5.1 is available on Zenodo with DOI 10.5281/zenodo.14164801.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143082449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-03DOI: 10.1093/bioinformatics/btaf055
Stephen Salerno, Jiacheng Miao, Awan Afiaz, Kentaro Hoffman, Anna Neufeld, Qiongshi Lu, Tyler H McCormick, Jeffrey T Leek
Summary: ipd is an open-source R software package for the downstream modeling of an outcome and its associated features where a potentially sizable portion of the outcome data has been imputed by an artificial intelligence or machine learning (AI/ML) prediction algorithm. The package implements several recent proposed methods for inference on predicted data (IPD) with a single, user-friendly wrapper function, ipd. The package also provides custom print, summary, tidy, glance, and augment methods to facilitate easy model inspection. This document introduces the ipd software package and provides a demonstration of its basic usage.
Availability: ipd is freely available on CRAN or as a developer version at our GitHub page: github.com/ipd-tools/ipd. Full documentation, including detailed instructions and a usage 'vignette' are available at github.com/ipd-tools/ipd.
{"title":"ipd: An R Package for Conducting Inference on Predicted Data.","authors":"Stephen Salerno, Jiacheng Miao, Awan Afiaz, Kentaro Hoffman, Anna Neufeld, Qiongshi Lu, Tyler H McCormick, Jeffrey T Leek","doi":"10.1093/bioinformatics/btaf055","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf055","url":null,"abstract":"<p><strong>Summary: </strong>ipd is an open-source R software package for the downstream modeling of an outcome and its associated features where a potentially sizable portion of the outcome data has been imputed by an artificial intelligence or machine learning (AI/ML) prediction algorithm. The package implements several recent proposed methods for inference on predicted data (IPD) with a single, user-friendly wrapper function, ipd. The package also provides custom print, summary, tidy, glance, and augment methods to facilitate easy model inspection. This document introduces the ipd software package and provides a demonstration of its basic usage.</p><p><strong>Availability: </strong>ipd is freely available on CRAN or as a developer version at our GitHub page: github.com/ipd-tools/ipd. Full documentation, including detailed instructions and a usage 'vignette' are available at github.com/ipd-tools/ipd.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143082492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-31DOI: 10.1093/bioinformatics/btaf051
Constance Creux, Farida Zehraoui, François Radvanyi, Fariza Tahi
Motivation: As the biological roles and disease implications of non-coding RNAs continue to emerge, the need to thoroughly characterize previously unexplored non-coding RNAs becomes increasingly urgent. These molecules hold potential as biomarkers and therapeutic targets. However, the vast and complex nature of non-coding RNAs data presents a challenge. We introduce MMnc, an interpretable deep learning approach designed to classify non-coding RNAs into functional groups. MMnc leverages multiple data sources-such as the sequence, secondary structure, and expression-using attention-based multi-modal data integration. This ensures learning of meaningful representations while accounting for missing sources in some samples.
Results: Our findings demonstrate that MMnc achieves high classification accuracy across diverse non-coding RNA classes. The method's modular architecture allows for the consideration of multiple types of modalities, whereas other tools only consider one or two at most. MMnc is resilient to missing data, ensuring that all available information is effectively utilized. Importantly, the generated attention scores offer interpretable insights into the underlying patterns of the different non-coding RNA classes, potentially driving future non-coding RNA research and applications.
Availability: Data and source code can be found at EvryRNA.ibisc.univ-evry.fr/EvryRNA/MMnc.
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"MMnc: Multi-modal interpretable representation for non-coding RNA classification and class annotation.","authors":"Constance Creux, Farida Zehraoui, François Radvanyi, Fariza Tahi","doi":"10.1093/bioinformatics/btaf051","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf051","url":null,"abstract":"<p><strong>Motivation: </strong>As the biological roles and disease implications of non-coding RNAs continue to emerge, the need to thoroughly characterize previously unexplored non-coding RNAs becomes increasingly urgent. These molecules hold potential as biomarkers and therapeutic targets. However, the vast and complex nature of non-coding RNAs data presents a challenge. We introduce MMnc, an interpretable deep learning approach designed to classify non-coding RNAs into functional groups. MMnc leverages multiple data sources-such as the sequence, secondary structure, and expression-using attention-based multi-modal data integration. This ensures learning of meaningful representations while accounting for missing sources in some samples.</p><p><strong>Results: </strong>Our findings demonstrate that MMnc achieves high classification accuracy across diverse non-coding RNA classes. The method's modular architecture allows for the consideration of multiple types of modalities, whereas other tools only consider one or two at most. MMnc is resilient to missing data, ensuring that all available information is effectively utilized. Importantly, the generated attention scores offer interpretable insights into the underlying patterns of the different non-coding RNA classes, potentially driving future non-coding RNA research and applications.</p><p><strong>Availability: </strong>Data and source code can be found at EvryRNA.ibisc.univ-evry.fr/EvryRNA/MMnc.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143076692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-31DOI: 10.1093/bioinformatics/btaf046
Mateusz Staniak, Ting Huang, Amanda M Figueroa-Navedo, Devon Kohler, Meena Choi, Trent Hinkle, Tracy Kleinheinz, Robert Blake, Christopher M Rose, Yingrong Xu, Pierre M Jean Beltran, Liang Xue, Małgorzata Bogdan, Olga Vitek
Motivation: Bottom-up mass spectrometry-based proteomics studies changes in protein abundance and structure across conditions. Since the currency of these experiments are peptides, ie subsets of protein sequences that carry the quantitative information, conclusions at a different level must be computationally inferred. The inference is particularly challenging in situations where the peptides are shared by multiple proteins or post-translational modifications. While many approaches infer the underlying abundances from unique peptides, there is a need to distinguish the quantitative patterns when peptides are shared.
Results: We propose a statistical approach for estimating protein abundances, as well as site occupancies of post-translational modifications, based on quantitative information from shared peptides. The approach treats the quantitative patterns of shared peptides as convex combinations of abundances of individual proteins or modification sites, and estimates the abundance of each source in a sample together with the weights of the combination. In simulation-based evaluations, the proposed approach improved the precision of estimated fold changes between conditions. We further demonstrated the practical utility of the approach in experiments with diverse biological objectives, ranging from protein degradation and thermal proteome stability, to changes in protein post-translational modifications.
Availability: The approach is implemented in an open-source R package MSstatsWeightedSummary. The package is currently available at https://github.com/Vitek-Lab/MSstatsWeightedSummary (doi:10.5281/zenodo.14662989). Code required to reproduce the results presented in this article can be found in a repository https://github.com/mstaniak/MWS_reproduction (doi:10.5281/zenodo.14656053).
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"Relative quantification of proteins and post-translational modifications in proteomic experiments with shared peptides: a weight-based approach.","authors":"Mateusz Staniak, Ting Huang, Amanda M Figueroa-Navedo, Devon Kohler, Meena Choi, Trent Hinkle, Tracy Kleinheinz, Robert Blake, Christopher M Rose, Yingrong Xu, Pierre M Jean Beltran, Liang Xue, Małgorzata Bogdan, Olga Vitek","doi":"10.1093/bioinformatics/btaf046","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf046","url":null,"abstract":"<p><strong>Motivation: </strong>Bottom-up mass spectrometry-based proteomics studies changes in protein abundance and structure across conditions. Since the currency of these experiments are peptides, ie subsets of protein sequences that carry the quantitative information, conclusions at a different level must be computationally inferred. The inference is particularly challenging in situations where the peptides are shared by multiple proteins or post-translational modifications. While many approaches infer the underlying abundances from unique peptides, there is a need to distinguish the quantitative patterns when peptides are shared.</p><p><strong>Results: </strong>We propose a statistical approach for estimating protein abundances, as well as site occupancies of post-translational modifications, based on quantitative information from shared peptides. The approach treats the quantitative patterns of shared peptides as convex combinations of abundances of individual proteins or modification sites, and estimates the abundance of each source in a sample together with the weights of the combination. In simulation-based evaluations, the proposed approach improved the precision of estimated fold changes between conditions. We further demonstrated the practical utility of the approach in experiments with diverse biological objectives, ranging from protein degradation and thermal proteome stability, to changes in protein post-translational modifications.</p><p><strong>Availability: </strong>The approach is implemented in an open-source R package MSstatsWeightedSummary. The package is currently available at https://github.com/Vitek-Lab/MSstatsWeightedSummary (doi:10.5281/zenodo.14662989). Code required to reproduce the results presented in this article can be found in a repository https://github.com/mstaniak/MWS_reproduction (doi:10.5281/zenodo.14656053).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143071284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-31DOI: 10.1093/bioinformatics/btae754
Mina Namazi, Mohammadali Farahpoor, Erman Ayday, Fernando Pérez-González
Motivation: The affordability of genome sequencing and the widespread availability of genomic data have opened up new medical possibilities. Nevertheless, they also raise significant concerns regarding privacy due to the sensitive information they encompass. These privacy implications act as barriers to medical research and data availability. Researchers have proposed privacy-preserving techniques to address this, with cryptography-based methods showing the most promise. However, existing cryptography-based designs lack i) interoperability, ii) scalability, iii) a high degree of privacy (i.e., compromise one to have the other), or (iv) multiparty analyses support (as most existing schemes process genomic information of each party individually). Overcoming these limitations is essential to unlocking the full potential of genomic data while ensuring privacy and data utility. Further research and development are needed to advance privacy-preserving techniques in genomics, focusing on achieving interoperability and scalability, preserving data utility, and enabling secure multiparty computation.
Results: This study aims to overcome the limitations of current cryptography-based techniques by employing a multi-key homomorphic encryption scheme. By utilizing this scheme, we have developed a comprehensive protocol capable of conducting diverse genomic analyses. Our protocol facilitates interoperability among individual genome processing and enables multiparty tests, analyses of genomic databases, and operations involving multiple databases. Consequently, our approach represents an innovative advancement in secure genomic data processing, offering enhanced protection and privacy measures.
Availability and implementation: All associated code and documentation, is available at https://github.com/farahpoor/smkhe.
{"title":"Privacy-Preserving Framework for Genomic Computations via Multi-Key Homomorphic Encryption.","authors":"Mina Namazi, Mohammadali Farahpoor, Erman Ayday, Fernando Pérez-González","doi":"10.1093/bioinformatics/btae754","DOIUrl":"https://doi.org/10.1093/bioinformatics/btae754","url":null,"abstract":"<p><strong>Motivation: </strong>The affordability of genome sequencing and the widespread availability of genomic data have opened up new medical possibilities. Nevertheless, they also raise significant concerns regarding privacy due to the sensitive information they encompass. These privacy implications act as barriers to medical research and data availability. Researchers have proposed privacy-preserving techniques to address this, with cryptography-based methods showing the most promise. However, existing cryptography-based designs lack i) interoperability, ii) scalability, iii) a high degree of privacy (i.e., compromise one to have the other), or (iv) multiparty analyses support (as most existing schemes process genomic information of each party individually). Overcoming these limitations is essential to unlocking the full potential of genomic data while ensuring privacy and data utility. Further research and development are needed to advance privacy-preserving techniques in genomics, focusing on achieving interoperability and scalability, preserving data utility, and enabling secure multiparty computation.</p><p><strong>Results: </strong>This study aims to overcome the limitations of current cryptography-based techniques by employing a multi-key homomorphic encryption scheme. By utilizing this scheme, we have developed a comprehensive protocol capable of conducting diverse genomic analyses. Our protocol facilitates interoperability among individual genome processing and enables multiparty tests, analyses of genomic databases, and operations involving multiple databases. Consequently, our approach represents an innovative advancement in secure genomic data processing, offering enhanced protection and privacy measures.</p><p><strong>Availability and implementation: </strong>All associated code and documentation, is available at https://github.com/farahpoor/smkhe.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143076693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-29DOI: 10.1093/bioinformatics/btaf044
Hua Meng, Chuan Qin, Zhiguo Long
Motivation: The rapid development of single-cell RNA sequencing (scRNA-seq) has significantly advanced biomedical research. Clustering analysis, crucial for scRNA-seq data, faces challenges including data sparsity, high dimensionality, and variable gene expressions. Better low-dimensional embeddings for these complex data should maintain intrinsic information while making similar data close and dissimilar data distant. However, existing methods utilizing neural networks typically focus on minimizing reconstruction loss and maintaining similarity in embeddings of directly related cells, but fail to consider dissimilarity, thus lacking separability and limiting the performance of clustering.
Results: We propose a novel clustering algorithm, called scHNTL (scRNA-seq data clustering augmented by high-order neighbors and triplet loss). It first constructs an auxiliary similarity graph and uses a Graph Attentional Autoencoder to learn initial embeddings of cells. Then it identifies similar and dissimilar cells by exploring high-order structures of the similarity graph and exploits a triplet loss of contrastive learning, to improve the embeddings in preserving structural information by separating dissimilar pairs. Finally, this improvement for embedding and the target of clustering are fused in a self-optimizing clustering framework to obtain the clusters. Experimental evaluations on 16 real-world datasets demonstrate the superiority of scHNTL in clustering over the state-of-the-arts single-cell clustering algorithms.
Availability and implementation: Python implementation of scHNTL is available at Figshare (https://doi.org/10.6084/m9.figshare.27001090) and Github (https://github.com/SWJTU-ML/scHNTL-code).
Supplementary information: Supplementary data are available at Bioinformatics online.
{"title":"scHNTL: single-cell RNA-seq data clustering augmented by high-order neighbors and triplet loss.","authors":"Hua Meng, Chuan Qin, Zhiguo Long","doi":"10.1093/bioinformatics/btaf044","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf044","url":null,"abstract":"<p><strong>Motivation: </strong>The rapid development of single-cell RNA sequencing (scRNA-seq) has significantly advanced biomedical research. Clustering analysis, crucial for scRNA-seq data, faces challenges including data sparsity, high dimensionality, and variable gene expressions. Better low-dimensional embeddings for these complex data should maintain intrinsic information while making similar data close and dissimilar data distant. However, existing methods utilizing neural networks typically focus on minimizing reconstruction loss and maintaining similarity in embeddings of directly related cells, but fail to consider dissimilarity, thus lacking separability and limiting the performance of clustering.</p><p><strong>Results: </strong>We propose a novel clustering algorithm, called scHNTL (scRNA-seq data clustering augmented by high-order neighbors and triplet loss). It first constructs an auxiliary similarity graph and uses a Graph Attentional Autoencoder to learn initial embeddings of cells. Then it identifies similar and dissimilar cells by exploring high-order structures of the similarity graph and exploits a triplet loss of contrastive learning, to improve the embeddings in preserving structural information by separating dissimilar pairs. Finally, this improvement for embedding and the target of clustering are fused in a self-optimizing clustering framework to obtain the clusters. Experimental evaluations on 16 real-world datasets demonstrate the superiority of scHNTL in clustering over the state-of-the-arts single-cell clustering algorithms.</p><p><strong>Availability and implementation: </strong>Python implementation of scHNTL is available at Figshare (https://doi.org/10.6084/m9.figshare.27001090) and Github (https://github.com/SWJTU-ML/scHNTL-code).</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143061623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}