What is a consistent glycan composition dataset?

IF 1.9 Frontiers in analytical science Pub Date : 2023-06-07 DOI:10.3389/frans.2023.1073540

Federico Saba, Julien Mariethoz, F. Lisacek

{"title":"What is a consistent glycan composition dataset?","authors":"Federico Saba, Julien Mariethoz, F. Lisacek","doi":"10.3389/frans.2023.1073540","DOIUrl":null,"url":null,"abstract":"Introduction: One of the main challenges in bioinformatics has been and still is, the comparison of entities through the development of algorithms for similarity scoring and data clustering according to biologically relevant aspects. Glycoinformatics also faces this challenge, in particular regarding the automated comparison of protein and/or tissue glycomes, that remains a relatively uncharted territory. Methods: Low and high throughput experimental glycomic and glycoproteomic results were collected, revealing a bias toward N-linked glycomes. Then, N-glycomes were considered and represented as networks of related glycan compositions as opposed to lists of glycans. They were processed and compared through a java application generating graphs and another producing a similarity matrix based on graph content. Several scoring schemes (e.g., Jaccard index or cosine) were tested and evaluated using the Matthews Correlation Coefficient, in order to capture a meaningful protein and tissue N-glycome similarity. Results: Assuming that a glycome corresponds to a well-connected graph of glycan compositions, graph comparison has revealed gaps that can be interpreted as inconsistencies. The outcome of systematic graph comparison is both formal and practical. In principle, it is shown that the idiosyncrasy of current glycome data limits the definition of appropriate estimates for systematically comparing N-glycomes. Yet, several potentially interesting criteria could be identified in a series of use cases detailed in the study. Discussion: Differentially expressed glycomes are usually compared manually, but the resulting work tends to remain in publications due to the lack of dedicated tools. Even manually, cross-comparison is challenging mostly because different sets of features are used from one study to the other. The work presented here enables laying down guidelines for developing a software tool comparing glycomes based on appropriate definitions of similarity and suitable methods for its evaluation and implementation.","PeriodicalId":73063,"journal":{"name":"Frontiers in analytical science","volume":" ","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2023-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in analytical science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/frans.2023.1073540","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: One of the main challenges in bioinformatics has been and still is, the comparison of entities through the development of algorithms for similarity scoring and data clustering according to biologically relevant aspects. Glycoinformatics also faces this challenge, in particular regarding the automated comparison of protein and/or tissue glycomes, that remains a relatively uncharted territory. Methods: Low and high throughput experimental glycomic and glycoproteomic results were collected, revealing a bias toward N-linked glycomes. Then, N-glycomes were considered and represented as networks of related glycan compositions as opposed to lists of glycans. They were processed and compared through a java application generating graphs and another producing a similarity matrix based on graph content. Several scoring schemes (e.g., Jaccard index or cosine) were tested and evaluated using the Matthews Correlation Coefficient, in order to capture a meaningful protein and tissue N-glycome similarity. Results: Assuming that a glycome corresponds to a well-connected graph of glycan compositions, graph comparison has revealed gaps that can be interpreted as inconsistencies. The outcome of systematic graph comparison is both formal and practical. In principle, it is shown that the idiosyncrasy of current glycome data limits the definition of appropriate estimates for systematically comparing N-glycomes. Yet, several potentially interesting criteria could be identified in a series of use cases detailed in the study. Discussion: Differentially expressed glycomes are usually compared manually, but the resulting work tends to remain in publications due to the lack of dedicated tools. Even manually, cross-comparison is challenging mostly because different sets of features are used from one study to the other. The work presented here enables laying down guidelines for developing a software tool comparing glycomes based on appropriate definitions of similarity and suitable methods for its evaluation and implementation.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

什么是一致的聚糖组成数据集?

引言：生物信息学的主要挑战之一一直是，现在仍然是，通过开发根据生物学相关方面进行相似性评分和数据聚类的算法来比较实体。糖信息学也面临着这一挑战，特别是在蛋白质和/或组织糖组的自动比较方面，这仍然是一个相对未知的领域。方法：收集低通量和高通量实验糖组学和糖蛋白质组学结果，揭示了对N-连接糖组的偏见。然后，N-糖组被认为是相关聚糖组成的网络，而不是聚糖列表。通过一个生成图形的java应用程序和另一个基于图形内容生成相似性矩阵的应用程序对它们进行处理和比较。使用Matthews相关系数测试和评估了几种评分方案（例如，Jaccard指数或余弦），以获取有意义的蛋白质和组织N-糖组相似性。结果：假设一个糖组对应于一个连接良好的聚糖组成图，图形比较揭示了可以被解释为不一致的差距。系统图比较的结果是形式化的和实用的。原则上，研究表明，当前糖组数据的特殊性限制了系统比较N-糖组的适当估计的定义。然而，在研究中详细介绍的一系列用例中，可以确定几个潜在的有趣标准。讨论：差异表达的糖组通常是手动比较的，但由于缺乏专用工具，结果往往保留在出版物中。即使是手动的，交叉比较也很有挑战性，主要是因为一项研究与另一项研究使用了不同的特征集。本文介绍的工作能够根据相似性的适当定义和评估和实施的适当方法，为开发比较糖组的软件工具制定指导方针。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Frontiers in analytical science

自引率

0.00%

发文量