Pub Date : 2020-02-10eCollection Date: 2020-01-01DOI: 10.1186/s13029-020-0077-1
Julian S DeVille, Daisuke Kihara, Atilla Sit
Background: Direct comparison of 2D images is computationally inefficient due to the need for translation, rotation, and scaling of the images to evaluate their similarity. In many biological applications, such as digital pathology and cryo-EM, often identifying specific local regions of images is of particular interest. Therefore, finding invariant descriptors that can efficiently retrieve local image patches or subimages becomes necessary.
Results: We present a software package called Two-Dimensional Krawtchouk Descriptors that allows to perform local subimage search in 2D images. The new toolkit uses only a small number of invariant descriptors per image for efficient local image retrieval. This enables querying an image and comparing similar patterns locally across a potentially large database. We show that these descriptors appear to be useful for searching local patterns or small particles in images and demonstrate some test cases that can be helpful for both assembly software developers and their users.
Conclusions: Local image comparison and subimage search can prove cumbersome in both computational complexity and runtime, due to factors such as the rotation, scaling, and translation of the object in question. By using the 2DKD toolkit, relatively few descriptors are developed to describe a given image, and this can be achieved with minimal memory usage.
{"title":"2DKD: a toolkit for content-based local image search.","authors":"Julian S DeVille, Daisuke Kihara, Atilla Sit","doi":"10.1186/s13029-020-0077-1","DOIUrl":"10.1186/s13029-020-0077-1","url":null,"abstract":"<p><strong>Background: </strong>Direct comparison of 2D images is computationally inefficient due to the need for translation, rotation, and scaling of the images to evaluate their similarity. In many biological applications, such as digital pathology and cryo-EM, often identifying specific local regions of images is of particular interest. Therefore, finding invariant descriptors that can efficiently retrieve local image patches or subimages becomes necessary.</p><p><strong>Results: </strong>We present a software package called Two-Dimensional Krawtchouk Descriptors that allows to perform local subimage search in 2D images. The new toolkit uses only a small number of invariant descriptors per image for efficient local image retrieval. This enables querying an image and comparing similar patterns locally across a potentially large database. We show that these descriptors appear to be useful for searching local patterns or small particles in images and demonstrate some test cases that can be helpful for both assembly software developers and their users.</p><p><strong>Conclusions: </strong>Local image comparison and subimage search can prove cumbersome in both computational complexity and runtime, due to factors such as the rotation, scaling, and translation of the object in question. By using the 2DKD toolkit, relatively few descriptors are developed to describe a given image, and this can be achieved with minimal memory usage.</p>","PeriodicalId":35052,"journal":{"name":"Source Code for Biology and Medicine","volume":"15 ","pages":"1"},"PeriodicalIF":0.0,"publicationDate":"2020-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7011505/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37649148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-12-20eCollection Date: 2019-01-01DOI: 10.1186/s13029-019-0076-2
Qing Yang, Xinming An, Wei Pan
Background: Any empirical data can be approximated to one of Pearson distributions using the first four moments of the data (Elderton WP, Johnson NL. Systems of Frequency Curves. 1969; Pearson K. Philos Trans R Soc Lond Ser A. 186:343-414 1895; Solomon H, Stephens MA. J Am Stat Assoc. 73(361):153-60 1978). Thus, Pearson distributions made statistical analysis possible for data with unknown distributions. There are both extant, old-fashioned in-print tables (Pearson ES, Hartley HO. Biometrika Tables for Statisticians, vol. II. 1972) and contemporary computer programs (Amos DE, Daniel SL. Tables of percentage points of standardized pearson distributions. 1971; Bouver H, Bargmann RE. Tables of the standardized percentage points of the pearson system of curves in terms of β1 and β2. 1974; Bowman KO, Shenton LR. Biometrika. 66(1):147-51 1979; Davis CS, Stephens MA. Appl Stat. 32(3):322-7 1983; Pan W. J Stat Softw. 31(Code Snippet 2):1-6 2009) available for obtaining percentage points of Pearson distributions corresponding to certain pre-specified percentages (or probability values; e.g., 1.0%, 2.5%, 5.0%, etc.), but they are little useful in statistical analysis because we have to rely on unwieldy second difference interpolation to calculate a probability value of a Pearson distribution corresponding to a given percentage point, such as an observed test statistic in hypothesis testing.
Results: The present study develops a SAS/IML macro program to identify the appropriate type of Pearson distribution based on either input of dataset or the values of four moments and then compute and graph probability values of Pearson distributions for any given percentage points.
Conclusions: The SAS macro program returns accurate approximations to Pearson distributions and can efficiently facilitate researchers to conduct statistical analysis on data with unknown distributions.
背景:任何经验数据都可以使用数据的前四个矩近似于皮尔逊分布之一(Elderton WP, Johnson NL)。频率曲线系统。1969;皮尔逊K. Philos Trans R Soc Ser . 186:343-414 1895;所罗门H,斯蒂芬斯MA。科学通报,23(3):393 - 398。因此,皮尔逊分布使得对未知分布的数据进行统计分析成为可能。现存的老式印刷表格都有(Pearson ES, Hartley HO)。统计学家生物计量表,第二卷。1972)和当代计算机程序(Amos DE, Daniel SL.标准化皮尔逊分布的百分比表)。1971;Bouver H, Bargmann RE.用β 1和β 2表示的皮尔逊曲线系统的标准化百分比表。1974;鲍曼KO,申顿LR。中华生物医学杂志,66(1):147-51 1979;Davis CS, Stephens MA。苹果统计,32(3):322-7 1983;Pan W. J Stat software . 31(代码片段2):1-6 2009)可用于获得与某些预先指定的百分比(或概率值;例如,1.0%,2.5%,5.0%等),但它们在统计分析中用处不大,因为我们必须依靠笨拙的二次差分插值来计算皮尔逊分布对应于给定百分点的概率值,例如假设检验中的观察检验统计量。结果:本研究开发了一个SAS/IML宏程序,根据数据集的输入或四个矩的值来识别适当类型的皮尔逊分布,然后计算并绘制任何给定百分比的皮尔逊分布的概率值。结论:SAS宏程序返回的Pearson分布的精确近似值,可以有效地方便研究者对未知分布的数据进行统计分析。
{"title":"Computing and graphing probability values of pearson distributions: a SAS/IML macro.","authors":"Qing Yang, Xinming An, Wei Pan","doi":"10.1186/s13029-019-0076-2","DOIUrl":"https://doi.org/10.1186/s13029-019-0076-2","url":null,"abstract":"<p><strong>Background: </strong>Any empirical data can be approximated to one of Pearson distributions using the first four moments of the data (Elderton WP, Johnson NL. Systems of Frequency Curves. 1969; Pearson K. Philos Trans R Soc Lond Ser A. 186:343-414 1895; Solomon H, Stephens MA. J Am Stat Assoc. 73(361):153-60 1978). Thus, Pearson distributions made statistical analysis possible for data with unknown distributions. There are both extant, old-fashioned in-print tables (Pearson ES, Hartley HO. Biometrika Tables for Statisticians, vol. II. 1972) and contemporary computer programs (Amos DE, Daniel SL. Tables of percentage points of standardized pearson distributions. 1971; Bouver H, Bargmann RE. Tables of the standardized percentage points of the pearson system of curves in terms of <i>β</i> <sub>1</sub> and <i>β</i> <sub>2</sub>. 1974; Bowman KO, Shenton LR. Biometrika. 66(1):147-51 1979; Davis CS, Stephens MA. Appl Stat. 32(3):322-7 1983; Pan W. J Stat Softw. 31(Code Snippet 2):1-6 2009) available for obtaining percentage points of Pearson distributions corresponding to certain <i>pre-specified</i> percentages (or probability values; e.g., 1.0%, 2.5%, 5.0%, etc.), but they are little useful in statistical analysis because we have to rely on unwieldy second difference interpolation to calculate a probability value of a Pearson distribution corresponding to a given percentage point, such as an observed test statistic in hypothesis testing.</p><p><strong>Results: </strong>The present study develops a SAS/IML macro program to identify the appropriate type of Pearson distribution based on either input of dataset or the values of four moments and then compute and graph probability values of Pearson distributions for <i>any</i> given percentage points.</p><p><strong>Conclusions: </strong>The SAS macro program returns accurate approximations to Pearson distributions and can efficiently facilitate researchers to conduct statistical analysis on data with unknown distributions.</p>","PeriodicalId":35052,"journal":{"name":"Source Code for Biology and Medicine","volume":"14 ","pages":"6"},"PeriodicalIF":0.0,"publicationDate":"2019-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13029-019-0076-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37503171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-11-02DOI: 10.1186/s13029-019-0075-3
Guilhem Faure, A. Joseph, Pierrick Craveur, T. Narwani, N. Srinivasan, Jean-Christophe Gelly, Joseph Rebehmed, A. D. de Brevern
{"title":"iPBAvizu: a PyMOL plugin for an efficient 3D protein structure superimposition approach","authors":"Guilhem Faure, A. Joseph, Pierrick Craveur, T. Narwani, N. Srinivasan, Jean-Christophe Gelly, Joseph Rebehmed, A. D. de Brevern","doi":"10.1186/s13029-019-0075-3","DOIUrl":"https://doi.org/10.1186/s13029-019-0075-3","url":null,"abstract":"","PeriodicalId":35052,"journal":{"name":"Source Code for Biology and Medicine","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13029-019-0075-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45661230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-08eCollection Date: 2019-01-01DOI: 10.1186/s13029-019-0074-4
Delfina Malandrino, Ilaria Manno, Alberto Negro, Andrea Petta, Luigi Serra, Concita Cantarella, Vittorio Scarano
Background: Next-generation sequencing (NGS) technologies have revolutionarily reshaped the landscape of '-omics' research areas. They produce a plethora of information requiring specific knowledge in sample preparation, analysis and characterization. Additionally, expertise and competencies are required when using bioinformatics tools and methods for efficient analysis, interpretation, and visualization of data. These skills are rarely covered in a single laboratory. More often the samples are isolated and purified in a first laboratory, sequencing is performed by a private company or a specialized lab, while the produced data are analyzed by a third group of researchers. In this scenario, the support, the communication, and the information sharing among researchers represent the key points to build a common knowledge and to meet the project objectives.
Results: We present ElGalaxy, a system designed and developed to support collaboration and information sharing among researchers. Specifically, we integrated collaborative functionalities within an application usually adopted by Life Science researchers. ElGalaxy, therefore, is the result of the integration of Galaxy, i.e., a Workflow Management System, with Elgg, i.e., a Social Network Engine.
Conclusions: ElGalaxy enables scientists, that work on the same experiment, to collaborate and share information, to discuss about methods, and to evaluate results of the individual steps, as well as of entire activities, performed during their experiments. ElGalaxy also allows a greater team awareness, especially when experiments are carried out with researchers which belong to different and distributed research centers.
{"title":"Social support for collaboration and group awareness in life science research teams.","authors":"Delfina Malandrino, Ilaria Manno, Alberto Negro, Andrea Petta, Luigi Serra, Concita Cantarella, Vittorio Scarano","doi":"10.1186/s13029-019-0074-4","DOIUrl":"10.1186/s13029-019-0074-4","url":null,"abstract":"<p><strong>Background: </strong>Next-generation sequencing (NGS) technologies have revolutionarily reshaped the landscape of '-omics' research areas. They produce a plethora of information requiring specific knowledge in sample preparation, analysis and characterization. Additionally, expertise and competencies are required when using bioinformatics tools and methods for efficient analysis, interpretation, and visualization of data. These skills are rarely covered in a single laboratory. More often the samples are isolated and purified in a first laboratory, sequencing is performed by a private company or a specialized lab, while the produced data are analyzed by a third group of researchers. In this scenario, the support, the communication, and the information sharing among researchers represent the key points to build a common knowledge and to meet the project objectives.</p><p><strong>Results: </strong>We present ElGalaxy, a system designed and developed to support collaboration and information sharing among researchers. Specifically, we integrated collaborative functionalities within an application usually adopted by Life Science researchers. ElGalaxy, therefore, is the result of the integration of Galaxy, i.e., a Workflow Management System, with Elgg, i.e., a Social Network Engine.</p><p><strong>Conclusions: </strong>ElGalaxy enables scientists, that work on the same experiment, to collaborate and share information, to discuss about methods, and to evaluate results of the individual steps, as well as of entire activities, performed during their experiments. ElGalaxy also allows a greater team awareness, especially when experiments are carried out with researchers which belong to different and distributed research centers.</p>","PeriodicalId":35052,"journal":{"name":"Source Code for Biology and Medicine","volume":" ","pages":"4"},"PeriodicalIF":0.0,"publicationDate":"2019-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6615102/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46694800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-06-03eCollection Date: 2019-01-01DOI: 10.1186/s13029-019-0073-5
Achraf El Allali, Mariam Arshad
Background: Due to the technological progress in Next Generation Sequencing (NGS), the amount of genomic data that is produced daily has seen a tremendous increase. This increase has shifted the bottleneck of genomic projects from sequencing to computation and specifically storing, managing and analyzing the large amount of NGS data. Compression tools can reduce the physical storage used to save large amount of genomic data as well as the bandwidth used to transfer this data. Recently, DNA sequence compression has gained much attention among researchers.
Results: In this paper, we study different techniques and algorithms used to compress genomic data. Most of these techniques take advantage of some properties that are unique to DNA sequences in order to improve the compression rate, and usually perform better than general-purpose compressors. By exploring the performance of available algorithms, we produce a powerful compression tool for NGS data called MZPAQ. Results show that MZPAQ outperforms state-of-the-art tools on all benchmark datasets obtained from a recent survey in terms of compression ratio. MZPAQ offers the best compression ratios regardless of the sequencing platform or the size of the data.
Conclusions: Currently, MZPAQ's strength is its higher compression ratio as well as its compatibility with all major sequencing platforms. MZPAQ is more suitable when the size of compressed data is crucial, such as long-term storage and data transfer. More efforts will be made in the future to target other aspects such as compression speed and memory utilization.
{"title":"MZPAQ: a FASTQ data compression tool.","authors":"Achraf El Allali, Mariam Arshad","doi":"10.1186/s13029-019-0073-5","DOIUrl":"https://doi.org/10.1186/s13029-019-0073-5","url":null,"abstract":"<p><strong>Background: </strong>Due to the technological progress in Next Generation Sequencing (NGS), the amount of genomic data that is produced daily has seen a tremendous increase. This increase has shifted the bottleneck of genomic projects from sequencing to computation and specifically storing, managing and analyzing the large amount of NGS data. Compression tools can reduce the physical storage used to save large amount of genomic data as well as the bandwidth used to transfer this data. Recently, DNA sequence compression has gained much attention among researchers.</p><p><strong>Results: </strong>In this paper, we study different techniques and algorithms used to compress genomic data. Most of these techniques take advantage of some properties that are unique to DNA sequences in order to improve the compression rate, and usually perform better than general-purpose compressors. By exploring the performance of available algorithms, we produce a powerful compression tool for NGS data called MZPAQ. Results show that MZPAQ outperforms state-of-the-art tools on all benchmark datasets obtained from a recent survey in terms of compression ratio. MZPAQ offers the best compression ratios regardless of the sequencing platform or the size of the data.</p><p><strong>Conclusions: </strong>Currently, MZPAQ's strength is its higher compression ratio as well as its compatibility with all major sequencing platforms. MZPAQ is more suitable when the size of compressed data is crucial, such as long-term storage and data transfer. More efforts will be made in the future to target other aspects such as compression speed and memory utilization.</p>","PeriodicalId":35052,"journal":{"name":"Source Code for Biology and Medicine","volume":"14 ","pages":"3"},"PeriodicalIF":0.0,"publicationDate":"2019-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13029-019-0073-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37308076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-03-20eCollection Date: 2019-01-01DOI: 10.1186/s13029-019-0072-6
Kridsadakorn Chaichoompu, Fentaw Abegaz, Sissades Tongsima, Philip James Shaw, Anavaj Sakuntabhai, Luísa Pereira, Kristel Van Steen
Background: Resolving population genetic structure is challenging, especially when dealing with closely related or geographically confined populations. Although Principal Component Analysis (PCA)-based methods and genomic variation with single nucleotide polymorphisms (SNPs) are widely used to describe shared genetic ancestry, improvements can be made especially when fine-scale population structure is the target.
Results: This work presents an R package called IPCAPS, which uses SNP information for resolving possibly fine-scale population structure. The IPCAPS routines are built on the iterative pruning Principal Component Analysis (ipPCA) framework that systematically assigns individuals to genetically similar subgroups. In each iteration, our tool is able to detect and eliminate outliers, hereby avoiding severe misclassification errors.
Conclusions: IPCAPS supports different measurement scales for variables used to identify substructure. Hence, panels of gene expression and methylation data can be accommodated as well. The tool can also be applied in patient sub-phenotyping contexts. IPCAPS is developed in R and is freely available from http://bio3.giga.ulg.ac.be/ipcaps.
{"title":"IPCAPS: an R package for iterative pruning to capture population structure.","authors":"Kridsadakorn Chaichoompu, Fentaw Abegaz, Sissades Tongsima, Philip James Shaw, Anavaj Sakuntabhai, Luísa Pereira, Kristel Van Steen","doi":"10.1186/s13029-019-0072-6","DOIUrl":"https://doi.org/10.1186/s13029-019-0072-6","url":null,"abstract":"<p><strong>Background: </strong>Resolving population genetic structure is challenging, especially when dealing with closely related or geographically confined populations. Although Principal Component Analysis (PCA)-based methods and genomic variation with single nucleotide polymorphisms (SNPs) are widely used to describe shared genetic ancestry, improvements can be made especially when fine-scale population structure is the target.</p><p><strong>Results: </strong>This work presents an R package called IPCAPS, which uses SNP information for resolving possibly fine-scale population structure. The IPCAPS routines are built on the iterative pruning Principal Component Analysis (ipPCA) framework that systematically assigns individuals to genetically similar subgroups. In each iteration, our tool is able to detect and eliminate outliers, hereby avoiding severe misclassification errors.</p><p><strong>Conclusions: </strong>IPCAPS supports different measurement scales for variables used to identify substructure. Hence, panels of gene expression and methylation data can be accommodated as well. The tool can also be applied in patient sub-phenotyping contexts. IPCAPS is developed in R and is freely available from http://bio3.giga.ulg.ac.be/ipcaps.</p>","PeriodicalId":35052,"journal":{"name":"Source Code for Biology and Medicine","volume":"14 ","pages":"2"},"PeriodicalIF":0.0,"publicationDate":"2019-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13029-019-0072-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37111284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-01-29eCollection Date: 2019-01-01DOI: 10.1186/s13029-019-0071-7
Veer Singh Marwah, Giovanni Scala, Pia Anneli Sofia Kinaret, Angela Serra, Harri Alenius, Vittorio Fortino, Dario Greco
Background: Application of microarrays in omics technologies enables quantification of many biomolecules simultaneously. It is widely applied to observe the positive or negative effect on biomolecule activity in perturbed versus the steady state by quantitative comparison. Community resources, such as Bioconductor and CRAN, host tools based on R language that have become standard for high-throughput analytics. However, application of these tools is technically challenging for generic users and require specific computational skills. There is a need for intuitive and easy-to-use platform to process omics data, visualize, and interpret results.
Results: We propose an integrated software solution, eUTOPIA, that implements a set of essential processing steps as a guided workflow presented to the user as an R Shiny application.
Conclusions: eUTOPIA allows researchers to perform preprocessing and analysis of microarray data via a simple and intuitive graphical interface while using state of the art methods.
背景:微阵列在全息技术中的应用可同时量化多种生物分子。它被广泛应用于通过定量比较来观察扰动状态与稳定状态对生物大分子活性的积极或消极影响。Bioconductor 和 CRAN 等社区资源中包含基于 R 语言的工具,这些工具已成为高通量分析的标准。然而,这些工具的应用对普通用户来说具有技术挑战性,需要特定的计算技能。我们需要一个直观易用的平台来处理 omics 数据、可视化和解释结果:结论:eUTOPIA 允许研究人员通过简单直观的图形界面对微阵列数据进行预处理和分析,同时使用最先进的方法。
{"title":"eUTOPIA: solUTion for Omics data PreprocessIng and Analysis.","authors":"Veer Singh Marwah, Giovanni Scala, Pia Anneli Sofia Kinaret, Angela Serra, Harri Alenius, Vittorio Fortino, Dario Greco","doi":"10.1186/s13029-019-0071-7","DOIUrl":"10.1186/s13029-019-0071-7","url":null,"abstract":"<p><strong>Background: </strong>Application of microarrays in omics technologies enables quantification of many biomolecules simultaneously. It is widely applied to observe the positive or negative effect on biomolecule activity in perturbed versus the steady state by quantitative comparison. Community resources, such as Bioconductor and CRAN, host tools based on R language that have become standard for high-throughput analytics. However, application of these tools is technically challenging for generic users and require specific computational skills. There is a need for intuitive and easy-to-use platform to process omics data, visualize, and interpret results.</p><p><strong>Results: </strong>We propose an integrated software solution, eUTOPIA, that implements a set of essential processing steps as a guided workflow presented to the user as an R Shiny application.</p><p><strong>Conclusions: </strong>eUTOPIA allows researchers to perform preprocessing and analysis of microarray data via a simple and intuitive graphical interface while using state of the art methods.</p>","PeriodicalId":35052,"journal":{"name":"Source Code for Biology and Medicine","volume":"14 ","pages":"1"},"PeriodicalIF":0.0,"publicationDate":"2019-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6352382/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36937294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-11-12eCollection Date: 2018-01-01DOI: 10.1186/s13029-018-0070-0
Daniel A Machlab, Gabriel Velez, Alexander G Bassuk, Vinit B Mahajan
Background: In proteomics studies, liquid chromatography tandem mass spectrometry data (LC-MS/MS) is quantified by spectral counts or by some measure of ion abundance. Downstream comparative analysis of protein content (e.g. Venn diagrams and network analysis) typically does not include this quantitative data and critical information is often lost. To avoid loss of spectral count data in comparative proteomic analyses, it is critical to implement a tool that can rapidly retrieve this information.
Results: We developed ProSave, a free and user-friendly Java-based program that retrieves spectral count data from a curated list of proteins in a large proteomics dataset. ProSave allows for the management of LC-MS/MS datasets and rapidly retrieves spectral count information for a desired list of proteins.
Conclusions: ProSave is open source and freely available at https://github.com/MahajanLab/ProSave. The user manual, implementation notes, and description of methodology and examples are available on the site.
{"title":"ProSave: an application for restoring quantitative data to manipulated subsets of protein lists.","authors":"Daniel A Machlab, Gabriel Velez, Alexander G Bassuk, Vinit B Mahajan","doi":"10.1186/s13029-018-0070-0","DOIUrl":"https://doi.org/10.1186/s13029-018-0070-0","url":null,"abstract":"<p><strong>Background: </strong>In proteomics studies, liquid chromatography tandem mass spectrometry data (LC-MS/MS) is quantified by spectral counts or by some measure of ion abundance. Downstream comparative analysis of protein content (e.g. Venn diagrams and network analysis) typically does not include this quantitative data and critical information is often lost. To avoid loss of spectral count data in comparative proteomic analyses, it is critical to implement a tool that can rapidly retrieve this information.</p><p><strong>Results: </strong>We developed ProSave, a free and user-friendly Java-based program that retrieves spectral count data from a curated list of proteins in a large proteomics dataset. ProSave allows for the management of LC-MS/MS datasets and rapidly retrieves spectral count information for a desired list of proteins.</p><p><strong>Conclusions: </strong>ProSave is open source and freely available at https://github.com/MahajanLab/ProSave. The user manual, implementation notes, and description of methodology and examples are available on the site.</p>","PeriodicalId":35052,"journal":{"name":"Source Code for Biology and Medicine","volume":"13 ","pages":"3"},"PeriodicalIF":0.0,"publicationDate":"2018-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13029-018-0070-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36691465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-10-15eCollection Date: 2018-01-01DOI: 10.1186/s13029-018-0069-6
Christina Nieuwoudt, Samantha J Jones, Angela Brooks-Wilson, Jinko Graham
Background: Studies that ascertain families containing multiple relatives affected by disease can be useful for identification of causal, rare variants from next-generation sequencing data.
Results: We present the R package SimRVPedigree, which allows researchers to simulate pedigrees ascertained on the basis of multiple, affected relatives. By incorporating the ascertainment process in the simulation, SimRVPedigree allows researchers to better understand the within-family patterns of relationship amongst affected individuals and ages of disease onset.
Conclusions: Through simulation, we show that affected members of a family segregating a rare disease variant tend to be more numerous and cluster in relationships more closely than those for sporadic disease. We also show that the family ascertainment process can lead to apparent anticipation in the age of onset. Finally, we use simulation to gain insight into the limit on the proportion of ascertained families segregating a causal variant. SimRVPedigree should be useful to investigators seeking insight into the family-based study design through simulation.
{"title":"Simulating pedigrees ascertained for multiple disease-affected relatives.","authors":"Christina Nieuwoudt, Samantha J Jones, Angela Brooks-Wilson, Jinko Graham","doi":"10.1186/s13029-018-0069-6","DOIUrl":"https://doi.org/10.1186/s13029-018-0069-6","url":null,"abstract":"<p><strong>Background: </strong>Studies that ascertain families containing multiple relatives affected by disease can be useful for identification of causal, rare variants from next-generation sequencing data.</p><p><strong>Results: </strong>We present the R package SimRVPedigree, which allows researchers to simulate pedigrees ascertained on the basis of multiple, affected relatives. By incorporating the ascertainment process in the simulation, SimRVPedigree allows researchers to better understand the within-family patterns of relationship amongst affected individuals and ages of disease onset.</p><p><strong>Conclusions: </strong>Through simulation, we show that affected members of a family segregating a rare disease variant tend to be more numerous and cluster in relationships more closely than those for sporadic disease. We also show that the family ascertainment process can lead to apparent anticipation in the age of onset. Finally, we use simulation to gain insight into the limit on the proportion of ascertained families segregating a causal variant. SimRVPedigree should be useful to investigators seeking insight into the family-based study design through simulation.</p>","PeriodicalId":35052,"journal":{"name":"Source Code for Biology and Medicine","volume":"13 ","pages":"2"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13029-018-0069-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36614519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-04-20eCollection Date: 2018-01-01DOI: 10.1186/s13029-018-0068-7
Tong Liu, Zheng Wang
Background: The segment overlap score (SOV) has been used to evaluate the predicted protein secondary structures, a sequence composed of helix (H), strand (E), and coil (C), by comparing it with the native or reference secondary structures, another sequence of H, E, and C. SOV's advantage is that it can consider the size of continuous overlapping segments and assign extra allowance to longer continuous overlapping segments instead of only judging from the percentage of overlapping individual positions as Q3 score does. However, we have found a drawback from its previous definition, that is, it cannot ensure increasing allowance assignment when more residues in a segment are further predicted accurately.
Results: A new way of assigning allowance has been designed, which keeps all the advantages of the previous SOV score definitions and ensures that the amount of allowance assigned is incremental when more elements in a segment are predicted accurately. Furthermore, our improved SOV has achieved a higher correlation with the quality of protein models measured by GDT-TS score and TM-score, indicating its better abilities to evaluate tertiary structure quality at the secondary structure level. We analyzed the statistical significance of SOV scores and found the threshold values for distinguishing two protein structures (SOV_refine > 0.19) and indicating whether two proteins are under the same CATH fold (SOV_refine > 0.94 and > 0.90 for three- and eight-state secondary structures respectively). We provided another two example applications, which are when used as a machine learning feature for protein model quality assessment and comparing different definitions of topologically associating domains. We proved that our newly defined SOV score resulted in better performance.
Conclusions: The SOV score can be widely used in bioinformatics research and other fields that need to compare two sequences of letters in which continuous segments have important meanings. We also generalized the previous SOV definitions so that it can work for sequences composed of more than three states (e.g., it can work for the eight-state definition of protein secondary structures). A standalone software package has been implemented in Perl with source code released. The software can be downloaded from http://dna.cs.miami.edu/SOV/.
{"title":"SOV_refine: A further refined definition of segment overlap score and its significance for protein structure similarity.","authors":"Tong Liu, Zheng Wang","doi":"10.1186/s13029-018-0068-7","DOIUrl":"https://doi.org/10.1186/s13029-018-0068-7","url":null,"abstract":"<p><strong>Background: </strong>The segment overlap score (SOV) has been used to evaluate the predicted protein secondary structures, a sequence composed of helix (H), strand (E), and coil (C), by comparing it with the native or reference secondary structures, another sequence of H, E, and C. SOV's advantage is that it can consider the size of continuous overlapping segments and assign extra allowance to longer continuous overlapping segments instead of only judging from the percentage of overlapping individual positions as Q3 score does. However, we have found a drawback from its previous definition, that is, it cannot ensure increasing allowance assignment when more residues in a segment are further predicted accurately.</p><p><strong>Results: </strong>A new way of assigning allowance has been designed, which keeps all the advantages of the previous SOV score definitions and ensures that the amount of allowance assigned is incremental when more elements in a segment are predicted accurately. Furthermore, our improved SOV has achieved a higher correlation with the quality of protein models measured by GDT-TS score and TM-score, indicating its better abilities to evaluate tertiary structure quality at the secondary structure level. We analyzed the statistical significance of SOV scores and found the threshold values for distinguishing two protein structures (SOV_refine > 0.19) and indicating whether two proteins are under the same CATH fold (SOV_refine > 0.94 and > 0.90 for three- and eight-state secondary structures respectively). We provided another two example applications, which are when used as a machine learning feature for protein model quality assessment and comparing different definitions of topologically associating domains. We proved that our newly defined SOV score resulted in better performance.</p><p><strong>Conclusions: </strong>The SOV score can be widely used in bioinformatics research and other fields that need to compare two sequences of letters in which continuous segments have important meanings. We also generalized the previous SOV definitions so that it can work for sequences composed of more than three states (e.g., it can work for the eight-state definition of protein secondary structures). A standalone software package has been implemented in Perl with source code released. The software can be downloaded from http://dna.cs.miami.edu/SOV/.</p>","PeriodicalId":35052,"journal":{"name":"Source Code for Biology and Medicine","volume":"13 ","pages":"1"},"PeriodicalIF":0.0,"publicationDate":"2018-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s13029-018-0068-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36058289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}