Source Code for Biology and Medicine最新文献

2DKD: a toolkit for content-based local image search. 2DKD：基于内容的本地图像搜索工具包。

Q2 Decision Sciences

Source Code for Biology and Medicine

Pub Date : 2020-02-10 eCollection Date: 2020-01-01 DOI: 10.1186/s13029-020-0077-1

Julian S DeVille, Daisuke Kihara, Atilla Sit

Background: Direct comparison of 2D images is computationally inefficient due to the need for translation, rotation, and scaling of the images to evaluate their similarity. In many biological applications, such as digital pathology and cryo-EM, often identifying specific local regions of images is of particular interest. Therefore, finding invariant descriptors that can efficiently retrieve local image patches or subimages becomes necessary.

Results: We present a software package called Two-Dimensional Krawtchouk Descriptors that allows to perform local subimage search in 2D images. The new toolkit uses only a small number of invariant descriptors per image for efficient local image retrieval. This enables querying an image and comparing similar patterns locally across a potentially large database. We show that these descriptors appear to be useful for searching local patterns or small particles in images and demonstrate some test cases that can be helpful for both assembly software developers and their users.

Conclusions: Local image comparison and subimage search can prove cumbersome in both computational complexity and runtime, due to factors such as the rotation, scaling, and translation of the object in question. By using the 2DKD toolkit, relatively few descriptors are developed to describe a given image, and this can be achieved with minimal memory usage.

背景：由于需要对图像进行平移、旋转和缩放以评估其相似性，因此直接比较二维图像的计算效率很低。在数字病理学和低温电子显微镜等许多生物应用中，识别图像的特定局部区域往往特别重要。因此，寻找能够有效检索局部图像片段或子图像的不变描述符就变得十分必要：我们推出了一款名为 "二维 Krawtchouk 描述符 "的软件包，可在二维图像中执行局部子图像搜索。新工具包只使用每幅图像的少量不变描述符来进行高效的局部图像检索。这样就能在潜在的大型数据库中查询图像并比较本地的相似模式。我们展示了这些描述符似乎对搜索图像中的局部模式或小颗粒很有用，并演示了一些测试案例，这些案例对装配软件开发人员及其用户都很有帮助：局部图像比较和子图像搜索在计算复杂度和运行时间方面都很繁琐，这是由于相关对象的旋转、缩放和平移等因素造成的。通过使用 2DKD 工具包，只需开发相对较少的描述符即可描述给定图像，而且内存使用量极小。

{"title":"2DKD: a toolkit for content-based local image search.","authors":"Julian S DeVille, Daisuke Kihara, Atilla Sit","doi":"10.1186/s13029-020-0077-1","DOIUrl":"10.1186/s13029-020-0077-1","url":null,"abstract":"Background: Direct comparison of 2D images is computationally inefficient due to the need for translation, rotation, and scaling of the images to evaluate their similarity. In many biological applications, such as digital pathology and cryo-EM, often identifying specific local regions of images is of particular interest. Therefore, finding invariant descriptors that can efficiently retrieve local image patches or subimages becomes necessary.Results: We present a software package called Two-Dimensional Krawtchouk Descriptors that allows to perform local subimage search in 2D images. The new toolkit uses only a small number of invariant descriptors per image for efficient local image retrieval. This enables querying an image and comparing similar patterns locally across a potentially large database. We show that these descriptors appear to be useful for searching local patterns or small particles in images and demonstrate some test cases that can be helpful for both assembly software developers and their users.Conclusions: Local image comparison and subimage search can prove cumbersome in both computational complexity and runtime, due to factors such as the rotation, scaling, and translation of the object in question. By using the 2DKD toolkit, relatively few descriptors are developed to describe a given image, and this can be achieved with minimal memory usage.","PeriodicalId":35052,"journal":{"name":"Source Code for Biology and Medicine","volume":"15 ","pages":"1"},"PeriodicalIF":0.0,"publicationDate":"2020-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7011505/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37649148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Computing and graphing probability values of pearson distributions: a SAS/IML macro. 计算和绘制皮尔逊分布的概率值:一个SAS/IML宏。

Q2 Decision Sciences

Source Code for Biology and Medicine

Pub Date : 2019-12-20 eCollection Date: 2019-01-01 DOI: 10.1186/s13029-019-0076-2

Qing Yang, Xinming An, Wei Pan

Background: Any empirical data can be approximated to one of Pearson distributions using the first four moments of the data (Elderton WP, Johnson NL. Systems of Frequency Curves. 1969; Pearson K. Philos Trans R Soc Lond Ser A. 186:343-414 1895; Solomon H, Stephens MA. J Am Stat Assoc. 73(361):153-60 1978). Thus, Pearson distributions made statistical analysis possible for data with unknown distributions. There are both extant, old-fashioned in-print tables (Pearson ES, Hartley HO. Biometrika Tables for Statisticians, vol. II. 1972) and contemporary computer programs (Amos DE, Daniel SL. Tables of percentage points of standardized pearson distributions. 1971; Bouver H, Bargmann RE. Tables of the standardized percentage points of the pearson system of curves in terms of β ₁ and β ₂. 1974; Bowman KO, Shenton LR. Biometrika. 66(1):147-51 1979; Davis CS, Stephens MA. Appl Stat. 32(3):322-7 1983; Pan W. J Stat Softw. 31(Code Snippet 2):1-6 2009) available for obtaining percentage points of Pearson distributions corresponding to certain pre-specified percentages (or probability values; e.g., 1.0%, 2.5%, 5.0%, etc.), but they are little useful in statistical analysis because we have to rely on unwieldy second difference interpolation to calculate a probability value of a Pearson distribution corresponding to a given percentage point, such as an observed test statistic in hypothesis testing.

Results: The present study develops a SAS/IML macro program to identify the appropriate type of Pearson distribution based on either input of dataset or the values of four moments and then compute and graph probability values of Pearson distributions for any given percentage points.

Conclusions: The SAS macro program returns accurate approximations to Pearson distributions and can efficiently facilitate researchers to conduct statistical analysis on data with unknown distributions.

背景:任何经验数据都可以使用数据的前四个矩近似于皮尔逊分布之一(Elderton WP, Johnson NL)。频率曲线系统。1969;皮尔逊K. Philos Trans R Soc Ser . 186:343-414 1895;所罗门H，斯蒂芬斯MA。科学通报，23(3):393 - 398。因此，皮尔逊分布使得对未知分布的数据进行统计分析成为可能。现存的老式印刷表格都有(Pearson ES, Hartley HO)。统计学家生物计量表，第二卷。1972)和当代计算机程序(Amos DE, Daniel SL.标准化皮尔逊分布的百分比表)。1971;Bouver H, Bargmann RE.用β 1和β 2表示的皮尔逊曲线系统的标准化百分比表。1974;鲍曼KO，申顿LR。中华生物医学杂志，66(1):147-51 1979;Davis CS, Stephens MA。苹果统计，32(3):322-7 1983;Pan W. J Stat software . 31(代码片段2):1-6 2009)可用于获得与某些预先指定的百分比(或概率值;例如，1.0%，2.5%，5.0%等)，但它们在统计分析中用处不大，因为我们必须依靠笨拙的二次差分插值来计算皮尔逊分布对应于给定百分点的概率值，例如假设检验中的观察检验统计量。结果:本研究开发了一个SAS/IML宏程序，根据数据集的输入或四个矩的值来识别适当类型的皮尔逊分布，然后计算并绘制任何给定百分比的皮尔逊分布的概率值。结论:SAS宏程序返回的Pearson分布的精确近似值，可以有效地方便研究者对未知分布的数据进行统计分析。

{"title":"Computing and graphing probability values of pearson distributions: a SAS/IML macro.","authors":"Qing Yang, Xinming An, Wei Pan","doi":"10.1186/s13029-019-0076-2","DOIUrl":"https://doi.org/10.1186/s13029-019-0076-2","url":null,"abstract":"Background: Any empirical data can be approximated to one of Pearson distributions using the first four moments of the data (Elderton WP, Johnson NL. Systems of Frequency Curves. 1969; Pearson K. Philos Trans R Soc Lond Ser A. 186:343-414 1895; Solomon H, Stephens MA. J Am Stat Assoc. 73(361):153-60 1978). Thus, Pearson distributions made statistical analysis possible for data with unknown distributions. There are both extant, old-fashioned in-print tables (Pearson ES, Hartley HO. Biometrika Tables for Statisticians, vol. II. 1972) and contemporary computer programs (Amos DE, Daniel SL. Tables of percentage points of standardized pearson distributions. 1971; Bouver H, Bargmann RE. Tables of the standardized percentage points of the pearson system of curves in terms of β 1 and β 2. 1974; Bowman KO, Shenton LR. Biometrika. 66(1):147-51 1979; Davis CS, Stephens MA. Appl Stat. 32(3):322-7 1983; Pan W. J Stat Softw. 31(Code Snippet 2):1-6 2009) available for obtaining percentage points of Pearson distributions corresponding to certain pre-specified percentages (or probability values; e.g., 1.0%, 2.5%, 5.0%, etc.), but they are little useful in statistical analysis because we have to rely on unwieldy second difference interpolation to calculate a probability value of a Pearson distribution corresponding to a given percentage point, such as an observed test statistic in hypothesis testing.Results: The present study develops a SAS/IML macro program to identify the appropriate type of Pearson distribution based on either input of dataset or the values of four moments and then compute and graph probability values of Pearson distributions for any given percentage points.Conclusions: The SAS macro program returns accurate approximations to Pearson distributions and can efficiently facilitate researchers to conduct statistical analysis on data with unknown distributions.","PeriodicalId":35052,"journal":{"name":"Source Code for Biology and Medicine","volume":"14 ","pages":"6"},"PeriodicalIF":0.0,"publicationDate":"2019-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13029-019-0076-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37503171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

iPBAvizu: a PyMOL plugin for an efficient 3D protein structure superimposition approach iPBAvizu：用于高效3D蛋白质结构叠加方法的PyMOL插件

Q2 Decision Sciences

Source Code for Biology and Medicine

Pub Date : 2019-11-02 DOI: 10.1186/s13029-019-0075-3

Guilhem Faure, A. Joseph, Pierrick Craveur, T. Narwani, N. Srinivasan, Jean-Christophe Gelly, Joseph Rebehmed, A. D. de Brevern

引用次数: 17

Social support for collaboration and group awareness in life science research teams. 生命科学研究团队协作与团队意识的社会支持

Q2 Decision Sciences

Source Code for Biology and Medicine

Pub Date : 2019-07-08 eCollection Date: 2019-01-01 DOI: 10.1186/s13029-019-0074-4

Delfina Malandrino, Ilaria Manno, Alberto Negro, Andrea Petta, Luigi Serra, Concita Cantarella, Vittorio Scarano

Background: Next-generation sequencing (NGS) technologies have revolutionarily reshaped the landscape of '-omics' research areas. They produce a plethora of information requiring specific knowledge in sample preparation, analysis and characterization. Additionally, expertise and competencies are required when using bioinformatics tools and methods for efficient analysis, interpretation, and visualization of data. These skills are rarely covered in a single laboratory. More often the samples are isolated and purified in a first laboratory, sequencing is performed by a private company or a specialized lab, while the produced data are analyzed by a third group of researchers. In this scenario, the support, the communication, and the information sharing among researchers represent the key points to build a common knowledge and to meet the project objectives.

Results: We present ElGalaxy, a system designed and developed to support collaboration and information sharing among researchers. Specifically, we integrated collaborative functionalities within an application usually adopted by Life Science researchers. ElGalaxy, therefore, is the result of the integration of Galaxy, i.e., a Workflow Management System, with Elgg, i.e., a Social Network Engine.

Conclusions: ElGalaxy enables scientists, that work on the same experiment, to collaborate and share information, to discuss about methods, and to evaluate results of the individual steps, as well as of entire activities, performed during their experiments. ElGalaxy also allows a greater team awareness, especially when experiments are carried out with researchers which belong to different and distributed research centers.

背景：下一代测序（NGS）技术彻底改变了 "组学 "研究领域的格局。这些技术产生了大量信息，需要在样本制备、分析和特征描述方面具备特定的知识。此外，在使用生物信息学工具和方法对数据进行高效分析、解读和可视化时，还需要专业知识和能力。这些技能很少能在一个实验室内完成。更常见的情况是，样本在第一个实验室分离和纯化，测序由私人公司或专业实验室进行，而产生的数据则由第三组研究人员进行分析。在这种情况下，研究人员之间的支持、交流和信息共享是建立共同知识和实现项目目标的关键点：我们介绍了 ElGalaxy，这是一个为支持研究人员之间的协作和信息共享而设计开发的系统。具体来说，我们在生命科学研究人员通常采用的应用程序中集成了协作功能。因此，ElGalaxy 是工作流程管理系统 Galaxy 与社交网络引擎 Elgg 整合的结果：ElGalaxy 使从事同一实验的科学家能够合作和共享信息，讨论实验方法，评估实验过程中各个步骤以及整个活动的结果。ElGalaxy 还能提高团队意识，尤其是在与分属不同研究中心的研究人员一起进行实验时。

{"title":"Social support for collaboration and group awareness in life science research teams.","authors":"Delfina Malandrino, Ilaria Manno, Alberto Negro, Andrea Petta, Luigi Serra, Concita Cantarella, Vittorio Scarano","doi":"10.1186/s13029-019-0074-4","DOIUrl":"10.1186/s13029-019-0074-4","url":null,"abstract":"Background: Next-generation sequencing (NGS) technologies have revolutionarily reshaped the landscape of '-omics' research areas. They produce a plethora of information requiring specific knowledge in sample preparation, analysis and characterization. Additionally, expertise and competencies are required when using bioinformatics tools and methods for efficient analysis, interpretation, and visualization of data. These skills are rarely covered in a single laboratory. More often the samples are isolated and purified in a first laboratory, sequencing is performed by a private company or a specialized lab, while the produced data are analyzed by a third group of researchers. In this scenario, the support, the communication, and the information sharing among researchers represent the key points to build a common knowledge and to meet the project objectives.Results: We present ElGalaxy, a system designed and developed to support collaboration and information sharing among researchers. Specifically, we integrated collaborative functionalities within an application usually adopted by Life Science researchers. ElGalaxy, therefore, is the result of the integration of Galaxy, i.e., a Workflow Management System, with Elgg, i.e., a Social Network Engine.Conclusions: ElGalaxy enables scientists, that work on the same experiment, to collaborate and share information, to discuss about methods, and to evaluate results of the individual steps, as well as of entire activities, performed during their experiments. ElGalaxy also allows a greater team awareness, especially when experiments are carried out with researchers which belong to different and distributed research centers.","PeriodicalId":35052,"journal":{"name":"Source Code for Biology and Medicine","volume":" ","pages":"4"},"PeriodicalIF":0.0,"publicationDate":"2019-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6615102/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46694800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MZPAQ: a FASTQ data compression tool. MZPAQ: FASTQ数据压缩工具。

Q2 Decision Sciences

Source Code for Biology and Medicine

Pub Date : 2019-06-03 eCollection Date: 2019-01-01 DOI: 10.1186/s13029-019-0073-5

Achraf El Allali, Mariam Arshad

Background: Due to the technological progress in Next Generation Sequencing (NGS), the amount of genomic data that is produced daily has seen a tremendous increase. This increase has shifted the bottleneck of genomic projects from sequencing to computation and specifically storing, managing and analyzing the large amount of NGS data. Compression tools can reduce the physical storage used to save large amount of genomic data as well as the bandwidth used to transfer this data. Recently, DNA sequence compression has gained much attention among researchers.

Results: In this paper, we study different techniques and algorithms used to compress genomic data. Most of these techniques take advantage of some properties that are unique to DNA sequences in order to improve the compression rate, and usually perform better than general-purpose compressors. By exploring the performance of available algorithms, we produce a powerful compression tool for NGS data called MZPAQ. Results show that MZPAQ outperforms state-of-the-art tools on all benchmark datasets obtained from a recent survey in terms of compression ratio. MZPAQ offers the best compression ratios regardless of the sequencing platform or the size of the data.

Conclusions: Currently, MZPAQ's strength is its higher compression ratio as well as its compatibility with all major sequencing platforms. MZPAQ is more suitable when the size of compressed data is crucial, such as long-term storage and data transfer. More efforts will be made in the future to target other aspects such as compression speed and memory utilization.

背景:由于下一代测序(NGS)技术的进步，每天产生的基因组数据量急剧增加。这一增长将基因组项目的瓶颈从测序转移到计算，特别是存储、管理和分析大量的NGS数据。压缩工具可以减少用于保存大量基因组数据的物理存储以及用于传输这些数据的带宽。近年来，DNA序列压缩技术受到了研究人员的广泛关注。结果:在本文中，我们研究了用于压缩基因组数据的不同技术和算法。这些技术中的大多数都利用DNA序列特有的一些特性来提高压缩率，并且通常比通用压缩器表现得更好。通过探索现有算法的性能，我们开发了一个强大的NGS数据压缩工具MZPAQ。结果表明，MZPAQ在最近的一项调查中获得的所有基准数据集的压缩比方面都优于最先进的工具。无论测序平台或数据大小如何，MZPAQ都提供最佳压缩比。结论:目前MZPAQ的优势在于其较高的压缩比以及与各大测序平台的兼容性。MZPAQ更适用于对压缩数据的大小要求非常严格的场合，如长期存储和数据传输等。将来还会在其他方面做出更多的努力，比如压缩速度和内存利用率。

{"title":"MZPAQ: a FASTQ data compression tool.","authors":"Achraf El Allali, Mariam Arshad","doi":"10.1186/s13029-019-0073-5","DOIUrl":"https://doi.org/10.1186/s13029-019-0073-5","url":null,"abstract":"Background: Due to the technological progress in Next Generation Sequencing (NGS), the amount of genomic data that is produced daily has seen a tremendous increase. This increase has shifted the bottleneck of genomic projects from sequencing to computation and specifically storing, managing and analyzing the large amount of NGS data. Compression tools can reduce the physical storage used to save large amount of genomic data as well as the bandwidth used to transfer this data. Recently, DNA sequence compression has gained much attention among researchers.Results: In this paper, we study different techniques and algorithms used to compress genomic data. Most of these techniques take advantage of some properties that are unique to DNA sequences in order to improve the compression rate, and usually perform better than general-purpose compressors. By exploring the performance of available algorithms, we produce a powerful compression tool for NGS data called MZPAQ. Results show that MZPAQ outperforms state-of-the-art tools on all benchmark datasets obtained from a recent survey in terms of compression ratio. MZPAQ offers the best compression ratios regardless of the sequencing platform or the size of the data.Conclusions: Currently, MZPAQ's strength is its higher compression ratio as well as its compatibility with all major sequencing platforms. MZPAQ is more suitable when the size of compressed data is crucial, such as long-term storage and data transfer. More efforts will be made in the future to target other aspects such as compression speed and memory utilization.","PeriodicalId":35052,"journal":{"name":"Source Code for Biology and Medicine","volume":"14 ","pages":"3"},"PeriodicalIF":0.0,"publicationDate":"2019-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13029-019-0073-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37308076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

IPCAPS: an R package for iterative pruning to capture population structure. IPCAPS:一个R包迭代修剪捕捉人口结构。

Q2 Decision Sciences

Source Code for Biology and Medicine

Pub Date : 2019-03-20 eCollection Date: 2019-01-01 DOI: 10.1186/s13029-019-0072-6

Kridsadakorn Chaichoompu, Fentaw Abegaz, Sissades Tongsima, Philip James Shaw, Anavaj Sakuntabhai, Luísa Pereira, Kristel Van Steen

Background: Resolving population genetic structure is challenging, especially when dealing with closely related or geographically confined populations. Although Principal Component Analysis (PCA)-based methods and genomic variation with single nucleotide polymorphisms (SNPs) are widely used to describe shared genetic ancestry, improvements can be made especially when fine-scale population structure is the target.

Results: This work presents an R package called IPCAPS, which uses SNP information for resolving possibly fine-scale population structure. The IPCAPS routines are built on the iterative pruning Principal Component Analysis (ipPCA) framework that systematically assigns individuals to genetically similar subgroups. In each iteration, our tool is able to detect and eliminate outliers, hereby avoiding severe misclassification errors.

Conclusions: IPCAPS supports different measurement scales for variables used to identify substructure. Hence, panels of gene expression and methylation data can be accommodated as well. The tool can also be applied in patient sub-phenotyping contexts. IPCAPS is developed in R and is freely available from http://bio3.giga.ulg.ac.be/ipcaps.

背景:解决群体遗传结构是具有挑战性的，特别是当处理密切相关或地理上受限制的群体时。尽管基于主成分分析(PCA)的方法和单核苷酸多态性(SNPs)的基因组变异被广泛用于描述共同的遗传祖先，但当精细尺度的群体结构是目标时，可以进行改进。结果:这项工作提出了一个名为IPCAPS的R包，它使用SNP信息来解决可能的精细尺度种群结构。IPCAPS例程建立在迭代修剪主成分分析(ipPCA)框架上，该框架系统地将个体分配到基因相似的亚群。在每次迭代中，我们的工具都能够检测和消除异常值，从而避免严重的误分类错误。结论:IPCAPS对用于识别子结构的变量支持不同的测量尺度。因此，基因表达和甲基化数据面板也可以容纳。该工具也可以应用于患者亚表型背景。IPCAPS是用R语言开发的，可以从http://bio3.giga.ulg.ac.be/ipcaps免费获得。

{"title":"IPCAPS: an R package for iterative pruning to capture population structure.","authors":"Kridsadakorn Chaichoompu, Fentaw Abegaz, Sissades Tongsima, Philip James Shaw, Anavaj Sakuntabhai, Luísa Pereira, Kristel Van Steen","doi":"10.1186/s13029-019-0072-6","DOIUrl":"https://doi.org/10.1186/s13029-019-0072-6","url":null,"abstract":"Background: Resolving population genetic structure is challenging, especially when dealing with closely related or geographically confined populations. Although Principal Component Analysis (PCA)-based methods and genomic variation with single nucleotide polymorphisms (SNPs) are widely used to describe shared genetic ancestry, improvements can be made especially when fine-scale population structure is the target.Results: This work presents an R package called IPCAPS, which uses SNP information for resolving possibly fine-scale population structure. The IPCAPS routines are built on the iterative pruning Principal Component Analysis (ipPCA) framework that systematically assigns individuals to genetically similar subgroups. In each iteration, our tool is able to detect and eliminate outliers, hereby avoiding severe misclassification errors.Conclusions: IPCAPS supports different measurement scales for variables used to identify substructure. Hence, panels of gene expression and methylation data can be accommodated as well. The tool can also be applied in patient sub-phenotyping contexts. IPCAPS is developed in R and is freely available from http://bio3.giga.ulg.ac.be/ipcaps.","PeriodicalId":35052,"journal":{"name":"Source Code for Biology and Medicine","volume":"14 ","pages":"2"},"PeriodicalIF":0.0,"publicationDate":"2019-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13029-019-0072-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37111284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

eUTOPIA: solUTion for Omics data PreprocessIng and Analysis. eUTOPIA：Omics 数据预处理和分析解决方案。

Q2 Decision Sciences

Source Code for Biology and Medicine

Pub Date : 2019-01-29 eCollection Date: 2019-01-01 DOI: 10.1186/s13029-019-0071-7

Veer Singh Marwah, Giovanni Scala, Pia Anneli Sofia Kinaret, Angela Serra, Harri Alenius, Vittorio Fortino, Dario Greco

Background: Application of microarrays in omics technologies enables quantification of many biomolecules simultaneously. It is widely applied to observe the positive or negative effect on biomolecule activity in perturbed versus the steady state by quantitative comparison. Community resources, such as Bioconductor and CRAN, host tools based on R language that have become standard for high-throughput analytics. However, application of these tools is technically challenging for generic users and require specific computational skills. There is a need for intuitive and easy-to-use platform to process omics data, visualize, and interpret results.

Results: We propose an integrated software solution, eUTOPIA, that implements a set of essential processing steps as a guided workflow presented to the user as an R Shiny application.

Conclusions: eUTOPIA allows researchers to perform preprocessing and analysis of microarray data via a simple and intuitive graphical interface while using state of the art methods.

背景：微阵列在全息技术中的应用可同时量化多种生物分子。它被广泛应用于通过定量比较来观察扰动状态与稳定状态对生物大分子活性的积极或消极影响。Bioconductor 和 CRAN 等社区资源中包含基于 R 语言的工具，这些工具已成为高通量分析的标准。然而，这些工具的应用对普通用户来说具有技术挑战性，需要特定的计算技能。我们需要一个直观易用的平台来处理 omics 数据、可视化和解释结果：结论：eUTOPIA 允许研究人员通过简单直观的图形界面对微阵列数据进行预处理和分析，同时使用最先进的方法。

{"title":"eUTOPIA: solUTion for Omics data PreprocessIng and Analysis.","authors":"Veer Singh Marwah, Giovanni Scala, Pia Anneli Sofia Kinaret, Angela Serra, Harri Alenius, Vittorio Fortino, Dario Greco","doi":"10.1186/s13029-019-0071-7","DOIUrl":"10.1186/s13029-019-0071-7","url":null,"abstract":"Background: Application of microarrays in omics technologies enables quantification of many biomolecules simultaneously. It is widely applied to observe the positive or negative effect on biomolecule activity in perturbed versus the steady state by quantitative comparison. Community resources, such as Bioconductor and CRAN, host tools based on R language that have become standard for high-throughput analytics. However, application of these tools is technically challenging for generic users and require specific computational skills. There is a need for intuitive and easy-to-use platform to process omics data, visualize, and interpret results.Results: We propose an integrated software solution, eUTOPIA, that implements a set of essential processing steps as a guided workflow presented to the user as an R Shiny application.Conclusions: eUTOPIA allows researchers to perform preprocessing and analysis of microarray data via a simple and intuitive graphical interface while using state of the art methods.","PeriodicalId":35052,"journal":{"name":"Source Code for Biology and Medicine","volume":"14 ","pages":"1"},"PeriodicalIF":0.0,"publicationDate":"2019-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6352382/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36937294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ProSave: an application for restoring quantitative data to manipulated subsets of protein lists. ProSave:用于将定量数据恢复到被操纵的蛋白质列表子集的应用程序。

Q2 Decision Sciences

Source Code for Biology and Medicine

Pub Date : 2018-11-12 eCollection Date: 2018-01-01 DOI: 10.1186/s13029-018-0070-0

Daniel A Machlab, Gabriel Velez, Alexander G Bassuk, Vinit B Mahajan

Background: In proteomics studies, liquid chromatography tandem mass spectrometry data (LC-MS/MS) is quantified by spectral counts or by some measure of ion abundance. Downstream comparative analysis of protein content (e.g. Venn diagrams and network analysis) typically does not include this quantitative data and critical information is often lost. To avoid loss of spectral count data in comparative proteomic analyses, it is critical to implement a tool that can rapidly retrieve this information.

Results: We developed ProSave, a free and user-friendly Java-based program that retrieves spectral count data from a curated list of proteins in a large proteomics dataset. ProSave allows for the management of LC-MS/MS datasets and rapidly retrieves spectral count information for a desired list of proteins.

Conclusions: ProSave is open source and freely available at https://github.com/MahajanLab/ProSave. The user manual, implementation notes, and description of methodology and examples are available on the site.

背景:在蛋白质组学研究中，液相色谱串联质谱数据(LC-MS/MS)是通过光谱计数或一些离子丰度测量来量化的。蛋白质含量的下游比较分析(如维恩图和网络分析)通常不包括这种定量数据，关键信息经常丢失。为了避免在比较蛋白质组学分析中丢失光谱计数数据，实现一个可以快速检索这些信息的工具是至关重要的。结果:我们开发了ProSave，这是一个免费且用户友好的基于java的程序，可以从大型蛋白质组学数据集中的蛋白质列表中检索光谱计数数据。ProSave允许管理LC-MS/MS数据集，并快速检索所需蛋白质列表的光谱计数信息。结论:ProSave是开源的，可以在https://github.com/MahajanLab/ProSave免费获得。用户手册、实现说明、方法描述和示例都可以在网站上找到。

{"title":"ProSave: an application for restoring quantitative data to manipulated subsets of protein lists.","authors":"Daniel A Machlab, Gabriel Velez, Alexander G Bassuk, Vinit B Mahajan","doi":"10.1186/s13029-018-0070-0","DOIUrl":"https://doi.org/10.1186/s13029-018-0070-0","url":null,"abstract":"Background: In proteomics studies, liquid chromatography tandem mass spectrometry data (LC-MS/MS) is quantified by spectral counts or by some measure of ion abundance. Downstream comparative analysis of protein content (e.g. Venn diagrams and network analysis) typically does not include this quantitative data and critical information is often lost. To avoid loss of spectral count data in comparative proteomic analyses, it is critical to implement a tool that can rapidly retrieve this information.Results: We developed ProSave, a free and user-friendly Java-based program that retrieves spectral count data from a curated list of proteins in a large proteomics dataset. ProSave allows for the management of LC-MS/MS datasets and rapidly retrieves spectral count information for a desired list of proteins.Conclusions: ProSave is open source and freely available at https://github.com/MahajanLab/ProSave. The user manual, implementation notes, and description of methodology and examples are available on the site.","PeriodicalId":35052,"journal":{"name":"Source Code for Biology and Medicine","volume":"13 ","pages":"3"},"PeriodicalIF":0.0,"publicationDate":"2018-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13029-018-0070-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36691465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Simulating pedigrees ascertained for multiple disease-affected relatives. 模拟多个患病亲属的家谱。

Q2 Decision Sciences

Source Code for Biology and Medicine

Pub Date : 2018-10-15 eCollection Date: 2018-01-01 DOI: 10.1186/s13029-018-0069-6

Christina Nieuwoudt, Samantha J Jones, Angela Brooks-Wilson, Jinko Graham

Background: Studies that ascertain families containing multiple relatives affected by disease can be useful for identification of causal, rare variants from next-generation sequencing data.

Results: We present the R package SimRVPedigree, which allows researchers to simulate pedigrees ascertained on the basis of multiple, affected relatives. By incorporating the ascertainment process in the simulation, SimRVPedigree allows researchers to better understand the within-family patterns of relationship amongst affected individuals and ages of disease onset.

Conclusions: Through simulation, we show that affected members of a family segregating a rare disease variant tend to be more numerous and cluster in relationships more closely than those for sporadic disease. We also show that the family ascertainment process can lead to apparent anticipation in the age of onset. Finally, we use simulation to gain insight into the limit on the proportion of ascertained families segregating a causal variant. SimRVPedigree should be useful to investigators seeking insight into the family-based study design through simulation.

背景:研究确定家庭中有多个亲属受疾病影响，可用于从下一代测序数据中识别因果关系，罕见变异。结果:我们提出了R包simmrvpedigree，它允许研究人员模拟在多个受影响亲属的基础上确定的谱系。通过将确定过程纳入模拟，simmrvpedigree使研究人员能够更好地了解受影响个体之间的家庭关系模式和疾病发病年龄。结论:通过模拟，我们表明，与散发疾病相比，分离罕见疾病变异的家庭成员数量更多，关系更紧密。我们还表明，家庭确定过程可以导致发病年龄的明显预期。最后，我们使用模拟来深入了解确定家庭分离因果变量的比例限制。simmrvpedigree应该是有用的研究者寻求深入了解基于家庭的研究设计通过模拟。

{"title":"Simulating pedigrees ascertained for multiple disease-affected relatives.","authors":"Christina Nieuwoudt, Samantha J Jones, Angela Brooks-Wilson, Jinko Graham","doi":"10.1186/s13029-018-0069-6","DOIUrl":"https://doi.org/10.1186/s13029-018-0069-6","url":null,"abstract":"Background: Studies that ascertain families containing multiple relatives affected by disease can be useful for identification of causal, rare variants from next-generation sequencing data.Results: We present the R package SimRVPedigree, which allows researchers to simulate pedigrees ascertained on the basis of multiple, affected relatives. By incorporating the ascertainment process in the simulation, SimRVPedigree allows researchers to better understand the within-family patterns of relationship amongst affected individuals and ages of disease onset.Conclusions: Through simulation, we show that affected members of a family segregating a rare disease variant tend to be more numerous and cluster in relationships more closely than those for sporadic disease. We also show that the family ascertainment process can lead to apparent anticipation in the age of onset. Finally, we use simulation to gain insight into the limit on the proportion of ascertained families segregating a causal variant. SimRVPedigree should be useful to investigators seeking insight into the family-based study design through simulation.","PeriodicalId":35052,"journal":{"name":"Source Code for Biology and Medicine","volume":"13 ","pages":"2"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13029-018-0069-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36614519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

SOV_refine: A further refined definition of segment overlap score and its significance for protein structure similarity. SOV_refine:进一步细化了片段重叠评分的定义及其对蛋白质结构相似性的意义。

Q2 Decision Sciences

Source Code for Biology and Medicine

Pub Date : 2018-04-20 eCollection Date: 2018-01-01 DOI: 10.1186/s13029-018-0068-7

Tong Liu, Zheng Wang

Background: The segment overlap score (SOV) has been used to evaluate the predicted protein secondary structures, a sequence composed of helix (H), strand (E), and coil (C), by comparing it with the native or reference secondary structures, another sequence of H, E, and C. SOV's advantage is that it can consider the size of continuous overlapping segments and assign extra allowance to longer continuous overlapping segments instead of only judging from the percentage of overlapping individual positions as Q3 score does. However, we have found a drawback from its previous definition, that is, it cannot ensure increasing allowance assignment when more residues in a segment are further predicted accurately.

Results: A new way of assigning allowance has been designed, which keeps all the advantages of the previous SOV score definitions and ensures that the amount of allowance assigned is incremental when more elements in a segment are predicted accurately. Furthermore, our improved SOV has achieved a higher correlation with the quality of protein models measured by GDT-TS score and TM-score, indicating its better abilities to evaluate tertiary structure quality at the secondary structure level. We analyzed the statistical significance of SOV scores and found the threshold values for distinguishing two protein structures (SOV_refine > 0.19) and indicating whether two proteins are under the same CATH fold (SOV_refine > 0.94 and > 0.90 for three- and eight-state secondary structures respectively). We provided another two example applications, which are when used as a machine learning feature for protein model quality assessment and comparing different definitions of topologically associating domains. We proved that our newly defined SOV score resulted in better performance.

Conclusions: The SOV score can be widely used in bioinformatics research and other fields that need to compare two sequences of letters in which continuous segments have important meanings. We also generalized the previous SOV definitions so that it can work for sequences composed of more than three states (e.g., it can work for the eight-state definition of protein secondary structures). A standalone software package has been implemented in Perl with source code released. The software can be downloaded from http://dna.cs.miami.edu/SOV/.

背景:片段重叠评分(SOV)被用来评估预测的蛋白质二级结构，由螺旋(H)，链(E)和线圈(C)组成的序列，通过将其与天然或参考二级结构，另一个序列H, E，SOV的优点在于它可以考虑连续重叠段的大小，并对较长的连续重叠段给予额外的补偿，而不是像Q3得分那样只从重叠的个体位置百分比来判断。然而，我们从之前的定义中发现了一个缺点，即当进一步准确预测段中更多残数时，不能保证增加余量分配。结果:设计了一种新的允许度分配方法，既保留了以往SOV评分定义的所有优点，又保证了当一个片段中的元素被准确预测时，允许度的分配是递增的。此外，我们改进的SOV与GDT-TS评分和tm评分测量的蛋白质模型质量具有更高的相关性，表明其在二级结构水平上评价三级结构质量的能力更好。我们分析了SOV评分的统计学意义，找到了区分两种蛋白质结构的阈值(SOV_refine > 0.19)和指示两种蛋白质是否处于同一CATH折叠下的阈值(三态二级结构SOV_refine > 0.94，八态二级结构SOV_refine > 0.90)。我们提供了另外两个示例应用程序，它们被用作蛋白质模型质量评估和比较拓扑关联域的不同定义的机器学习功能。我们证明了我们新定义的SOV分数带来了更好的性能。结论:SOV评分可广泛应用于生物信息学研究等需要比较两个字母序列，其中连续片段具有重要意义的领域。我们还推广了以前的SOV定义，使其可以适用于由三个以上状态组成的序列(例如，它可以适用于蛋白质二级结构的八状态定义)。一个独立的软件包已经用Perl实现，并发布了源代码。该软件可从http://dna.cs.miami.edu/SOV/下载。

{"title":"SOV_refine: A further refined definition of segment overlap score and its significance for protein structure similarity.","authors":"Tong Liu, Zheng Wang","doi":"10.1186/s13029-018-0068-7","DOIUrl":"https://doi.org/10.1186/s13029-018-0068-7","url":null,"abstract":"Background: The segment overlap score (SOV) has been used to evaluate the predicted protein secondary structures, a sequence composed of helix (H), strand (E), and coil (C), by comparing it with the native or reference secondary structures, another sequence of H, E, and C. SOV's advantage is that it can consider the size of continuous overlapping segments and assign extra allowance to longer continuous overlapping segments instead of only judging from the percentage of overlapping individual positions as Q3 score does. However, we have found a drawback from its previous definition, that is, it cannot ensure increasing allowance assignment when more residues in a segment are further predicted accurately.Results: A new way of assigning allowance has been designed, which keeps all the advantages of the previous SOV score definitions and ensures that the amount of allowance assigned is incremental when more elements in a segment are predicted accurately. Furthermore, our improved SOV has achieved a higher correlation with the quality of protein models measured by GDT-TS score and TM-score, indicating its better abilities to evaluate tertiary structure quality at the secondary structure level. We analyzed the statistical significance of SOV scores and found the threshold values for distinguishing two protein structures (SOV_refine > 0.19) and indicating whether two proteins are under the same CATH fold (SOV_refine > 0.94 and > 0.90 for three- and eight-state secondary structures respectively). We provided another two example applications, which are when used as a machine learning feature for protein model quality assessment and comparing different definitions of topologically associating domains. We proved that our newly defined SOV score resulted in better performance.Conclusions: The SOV score can be widely used in bioinformatics research and other fields that need to compare two sequences of letters in which continuous segments have important meanings. We also generalized the previous SOV definitions so that it can work for sequences composed of more than three states (e.g., it can work for the eight-state definition of protein secondary structures). A standalone software package has been implemented in Perl with source code released. The software can be downloaded from http://dna.cs.miami.edu/SOV/.","PeriodicalId":35052,"journal":{"name":"Source Code for Biology and Medicine","volume":"13 ","pages":"1"},"PeriodicalIF":0.0,"publicationDate":"2018-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13029-018-0068-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36058289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16