There is an increased awareness of the importance of data publication, data sharing, and open science to support research, monitoring and control of vector-borne disease (VBD). Here we describe the efforts of the Global Biodiversity Information Facility (GBIF) as well as the World Health Special Programme on Research and Training in Diseases of Poverty (TDR) to promote publication of data related to vectors of diseases. In 2020, a GBIF task group of experts was formed to provide advice and support efforts aimed at enhancing the coverage and accessibility of data on vectors of human diseases within GBIF. Various strategies, such as organizing training courses and publishing data papers, were used to increase this content. This editorial introduces the outcome of a second call for data papers partnered by the TDR, GBIF and GigaScience Press in the journal GigaByte. Biodiversity and infectious diseases are linked in complex ways. These links can involve changes from the microorganism level to that of the habitat, and there are many ways in which these factors interact to affect human health. One way to tackle disease control and possibly elimination, is to provide stakeholders with access to a wide range of data shared under the FAIR principles, so it is possible to support early detection, analyses and evaluation, and to promote policy improvements and/or development.
{"title":"Bridging Biodiversity and Health: The Global Biodiversity Information Facility's initiative on open data on vectors of human diseases.","authors":"Paloma Shimabukuro, Quentin Groom, Florence Fouque, Lindsay Campbell, Theeraphap Chareonviriyaphap, Josiane Etang, Sylvie Manguin, Marianne Sinka, Dmitry Schigel, Kate Ingenloff","doi":"10.46471/gigabyte.117","DOIUrl":"10.46471/gigabyte.117","url":null,"abstract":"<p><p>There is an increased awareness of the importance of data publication, data sharing, and open science to support research, monitoring and control of vector-borne disease (VBD). Here we describe the efforts of the Global Biodiversity Information Facility (GBIF) as well as the World Health Special Programme on Research and Training in Diseases of Poverty (TDR) to promote publication of data related to vectors of diseases. In 2020, a GBIF task group of experts was formed to provide advice and support efforts aimed at enhancing the coverage and accessibility of data on vectors of human diseases within GBIF. Various strategies, such as organizing training courses and publishing data papers, were used to increase this content. This editorial introduces the outcome of a second call for data papers partnered by the TDR, GBIF and GigaScience Press in the journal <i>GigaByte</i>. Biodiversity and infectious diseases are linked in complex ways. These links can involve changes from the microorganism level to that of the habitat, and there are many ways in which these factors interact to affect human health. One way to tackle disease control and possibly elimination, is to provide stakeholders with access to a wide range of data shared under the FAIR principles, so it is possible to support early detection, analyses and evaluation, and to promote policy improvements and/or development.</p>","PeriodicalId":73157,"journal":{"name":"GigaByte (Hong Kong, China)","volume":"2024 ","pages":"gigabyte117"},"PeriodicalIF":0.0,"publicationDate":"2024-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11027195/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140860840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-04-01eCollection Date: 2024-01-01DOI: 10.46471/gigabyte.116
Peter Menzel
With the advancement of long-read sequencing technologies and their increasing use for bacterial genomics, several methods for generating genome assemblies from error-prone long reads have been developed. These are complemented by various tools for assembly polishing using either long reads, short reads, or reference genomes. End users are therefore left with a plethora of possible combinations of programs for obtaining a final trusted assembly. Hence, there is also a need to measure the completeness and accuracy of such assemblies, for which, again, several evaluation methods implemented in various programs are available. In order to automatically run multiple genome assembly and evaluation programs at once, I developed two workflows for the workflow management system Snakemake, which provide end users with an easy-to-run solution for testing various genome assemblies from their sequencing data. Both workflows use the conda packaging system, so there is no need for manual installation of each program.
Availability & implementation: The workflows are available as open source software under the MIT license at github.com/pmenzel/ont-assembly-snake and github.com/pmenzel/score-assemblies.
随着长读数测序技术的发展及其在细菌基因组学中的应用日益广泛,已经开发出了几种从容易出错的长读数中生成基因组装配的方法。此外,还有各种利用长读数、短读数或参考基因组进行组装抛光的工具。因此,最终用户只能通过大量可能的程序组合来获得最终可信的组装结果。因此,还需要对这些组装的完整性和准确性进行测量,为此,在各种程序中也提供了多种评估方法。为了一次自动运行多个基因组组装和评估程序,我为工作流管理系统 Snakemake 开发了两个工作流,为终端用户提供了一个易于运行的解决方案,以测试其测序数据中的各种基因组组装。这两个工作流程都使用 conda 打包系统,因此无需手动安装每个程序:这两个工作流均为 MIT 许可下的开源软件,分别位于 github.com/pmenzel/ont-assembly-snake 和 github.com/pmenzel/score-assemblies。
{"title":"Snakemake workflows for long-read bacterial genome assembly and evaluation.","authors":"Peter Menzel","doi":"10.46471/gigabyte.116","DOIUrl":"10.46471/gigabyte.116","url":null,"abstract":"<p><p>With the advancement of long-read sequencing technologies and their increasing use for bacterial genomics, several methods for generating genome assemblies from error-prone long reads have been developed. These are complemented by various tools for assembly polishing using either long reads, short reads, or reference genomes. End users are therefore left with a plethora of possible combinations of programs for obtaining a final trusted assembly. Hence, there is also a need to measure the completeness and accuracy of such assemblies, for which, again, several evaluation methods implemented in various programs are available. In order to automatically run multiple genome assembly and evaluation programs at once, I developed two workflows for the workflow management system Snakemake, which provide end users with an easy-to-run solution for testing various genome assemblies from their sequencing data. Both workflows use the conda packaging system, so there is no need for manual installation of each program.</p><p><strong>Availability & implementation: </strong>The workflows are available as open source software under the MIT license at github.com/pmenzel/ont-assembly-snake and github.com/pmenzel/score-assemblies.</p>","PeriodicalId":73157,"journal":{"name":"GigaByte (Hong Kong, China)","volume":"2024 ","pages":"gigabyte116"},"PeriodicalIF":0.0,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11000499/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140874304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-21eCollection Date: 2024-01-01DOI: 10.46471/gigabyte.115
Remy Gatins, Carlos F Arias, Carlos Sánchez, Giacomo Bernardi, Luis F De León
Holacanthus angelfishes are some of the most iconic marine fishes of the Tropical Eastern Pacific (TEP). However, very limited genomic resources currently exist for the genus. In this study we: (i) assembled and annotated the nuclear genome of the King Angelfish (Holacanthus passer), and (ii) examined the demographic history of H. passer in the TEP. We generated 43.8 Gb of ONT and 97.3 Gb Illumina reads representing 75× and 167× coverage, respectively. The final genome assembly size was 583 Mb with a contig N50 of 5.7 Mb, which captured 97.5% of the complete Actinoterygii Benchmarking Universal Single-Copy Orthologs (BUSCOs). Repetitive elements accounted for 5.09% of the genome, and 33,889 protein-coding genes were predicted, of which 22,984 were functionally annotated. Our demographic analysis suggests that population expansions of H. passer occurred prior to the last glacial maximum (LGM) and were more likely shaped by events associated with the closure of the Isthmus of Panama. This result is surprising, given that most rapid population expansions in both freshwater and marine organisms have been reported to occur globally after the LGM. Overall, this annotated genome assembly provides a novel molecular resource to study the evolution of Holacanthus angelfishes, while facilitating research into local adaptation, speciation, and introgression in marine fishes.
{"title":"Whole genome assembly and annotation of the King Angelfish (<i>Holacanthus passer</i>) gives insight into the evolution of marine fishes of the Tropical Eastern Pacific.","authors":"Remy Gatins, Carlos F Arias, Carlos Sánchez, Giacomo Bernardi, Luis F De León","doi":"10.46471/gigabyte.115","DOIUrl":"10.46471/gigabyte.115","url":null,"abstract":"<p><p><i>Holacanthus</i> angelfishes are some of the most iconic marine fishes of the Tropical Eastern Pacific (TEP). However, very limited genomic resources currently exist for the genus. In this study we: (i) assembled and annotated the nuclear genome of the King Angelfish (<i>Holacanthus passer</i>), and (ii) examined the demographic history of <i>H. passer</i> in the TEP. We generated 43.8 Gb of ONT and 97.3 Gb Illumina reads representing 75× and 167× coverage, respectively. The final genome assembly size was 583 Mb with a contig N50 of 5.7 Mb, which captured 97.5% of the complete Actinoterygii Benchmarking Universal Single-Copy Orthologs (BUSCOs). Repetitive elements accounted for 5.09% of the genome, and 33,889 protein-coding genes were predicted, of which 22,984 were functionally annotated. Our demographic analysis suggests that population expansions of <i>H. passer</i> occurred prior to the last glacial maximum (LGM) and were more likely shaped by events associated with the closure of the Isthmus of Panama. This result is surprising, given that most rapid population expansions in both freshwater and marine organisms have been reported to occur globally after the LGM. Overall, this annotated genome assembly provides a novel molecular resource to study the evolution of <i>Holacanthus</i> angelfishes, while facilitating research into local adaptation, speciation, and introgression in marine fishes.</p>","PeriodicalId":73157,"journal":{"name":"GigaByte (Hong Kong, China)","volume":"2024 ","pages":"gigabyte115"},"PeriodicalIF":0.0,"publicationDate":"2024-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10973836/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140320042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-14eCollection Date: 2024-01-01DOI: 10.46471/gigabyte.114
Lipsa Priyadarsinee, Esther Jamir, Selvaraman Nagamani, Hridoy Jyoti Mahanta, Nandan Kumar, Lijo John, Himakshi Sarma, Asheesh Kumar, Anamika Singh Gaur, Rosaleen Sahoo, S Vaikundamani, N Arul Murugan, U Deva Priyakumar, G P S Raghava, Prasad V Bharatam, Ramakrishnan Parthasarathi, V Subramanian, G Madhavi Sastry, G Narahari Sastry
Molecular Property Diagnostic Suite (MPDS) was conceived and developed as an open-source disease-specific web portal based on Galaxy. MPDSCOVID-19 was developed for COVID-19 as a one-stop solution for drug discovery research. Galaxy platforms enable the creation of customized workflows connecting various modules in the web server. The architecture of MPDSCOVID-19 effectively employs Galaxy v22.04 features, which are ported on CentOS 7.8 and Python 3.7. MPDSCOVID-19 provides significant updates and the addition of several new tools updated after six years. Tools developed by our group in Perl/Python and open-source tools are collated and integrated into MPDSCOVID-19 using XML scripts. Our MPDS suite aims to facilitate transparent and open innovation. This approach significantly helps bring inclusiveness in the community while promoting free access and participation in software development.
Availability & implementation: The MPDSCOVID-19 portal can be accessed at https://mpds.neist.res.in:8085/.
{"title":"Molecular Property Diagnostic Suite for COVID-19 (MPDS<sup>COVID-19</sup>): an open-source disease-specific drug discovery portal.","authors":"Lipsa Priyadarsinee, Esther Jamir, Selvaraman Nagamani, Hridoy Jyoti Mahanta, Nandan Kumar, Lijo John, Himakshi Sarma, Asheesh Kumar, Anamika Singh Gaur, Rosaleen Sahoo, S Vaikundamani, N Arul Murugan, U Deva Priyakumar, G P S Raghava, Prasad V Bharatam, Ramakrishnan Parthasarathi, V Subramanian, G Madhavi Sastry, G Narahari Sastry","doi":"10.46471/gigabyte.114","DOIUrl":"10.46471/gigabyte.114","url":null,"abstract":"<p><p>Molecular Property Diagnostic Suite (MPDS) was conceived and developed as an open-source disease-specific web portal based on Galaxy. MPDS<sup>COVID-19</sup> was developed for COVID-19 as a one-stop solution for drug discovery research. Galaxy platforms enable the creation of customized workflows connecting various modules in the web server. The architecture of MPDS<sup>COVID-19</sup> effectively employs Galaxy v22.04 features, which are ported on CentOS 7.8 and Python 3.7. MPDS<sup>COVID-19</sup> provides significant updates and the addition of several new tools updated after six years. Tools developed by our group in Perl/Python and open-source tools are collated and integrated into MPDS<sup>COVID-19</sup> using XML scripts. Our MPDS suite aims to facilitate transparent and open innovation. This approach significantly helps bring inclusiveness in the community while promoting free access and participation in software development.</p><p><strong>Availability & implementation: </strong>The MPDS<sup>COVID-19</sup> portal can be accessed at https://mpds.neist.res.in:8085/.</p>","PeriodicalId":73157,"journal":{"name":"GigaByte (Hong Kong, China)","volume":"2024 ","pages":"gigabyte114"},"PeriodicalIF":0.0,"publicationDate":"2024-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10958779/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140208383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-07eCollection Date: 2024-01-01DOI: 10.46471/gigabyte.113
Sami Hamdan, Shammi More, Leonard Sasse, Vera Komeyer, Kaustubh R Patil, Federico Raimondo
The fast-paced development of machine learning (ML) and its increasing adoption in research challenge researchers without extensive training in ML. In neuroscience, ML can help understand brain-behavior relationships, diagnose diseases and develop biomarkers using data from sources like magnetic resonance imaging and electroencephalography. Primarily, ML builds models to make accurate predictions on unseen data. Researchers evaluate models' performance and generalizability using techniques such as cross-validation (CV). However, choosing a CV scheme and evaluating an ML pipeline is challenging and, if done improperly, can lead to overestimated results and incorrect interpretations. Here, we created julearn, an open-source Python library allowing researchers to design and evaluate complex ML pipelines without encountering common pitfalls. We present the rationale behind julearn's design, its core features, and showcase three examples of previously-published research projects. Julearn simplifies the access to ML providing an easy-to-use environment. With its design, unique features, simple interface, and practical documentation, it poses as a useful Python-based library for research projects.
机器学习(ML)的发展日新月异,在研究领域的应用也日益广泛,这对没有接受过广泛 ML 培训的研究人员提出了挑战。在神经科学领域,ML 可以帮助理解大脑与行为之间的关系,利用磁共振成像和脑电图等数据源诊断疾病和开发生物标记物。ML 主要是建立模型,对未见数据进行准确预测。研究人员使用交叉验证(CV)等技术评估模型的性能和可推广性。然而,选择交叉验证方案和评估 ML 管道具有挑战性,如果操作不当,可能会导致结果被高估和解释错误。在这里,我们创建了 julearn,这是一个开源 Python 库,允许研究人员设计和评估复杂的 ML 管道,而不会遇到常见的陷阱。我们介绍了 julearn 的设计原理、核心功能,并展示了之前发表的三个研究项目实例。Julearn 提供了一个易于使用的环境,简化了对 ML 的访问。凭借其设计、独特的功能、简单的界面和实用的文档,它成为研究项目中一个有用的基于 Python 的库。
{"title":"Julearn: an easy-to-use library for leakage-free evaluation and inspection of ML models.","authors":"Sami Hamdan, Shammi More, Leonard Sasse, Vera Komeyer, Kaustubh R Patil, Federico Raimondo","doi":"10.46471/gigabyte.113","DOIUrl":"10.46471/gigabyte.113","url":null,"abstract":"<p><p>The fast-paced development of machine learning (ML) and its increasing adoption in research challenge researchers without extensive training in ML. In neuroscience, ML can help understand brain-behavior relationships, diagnose diseases and develop biomarkers using data from sources like magnetic resonance imaging and electroencephalography. Primarily, ML builds models to make accurate predictions on unseen data. Researchers evaluate models' performance and generalizability using techniques such as cross-validation (CV). However, choosing a CV scheme and evaluating an ML pipeline is challenging and, if done improperly, can lead to overestimated results and incorrect interpretations. Here, we created julearn, an open-source Python library allowing researchers to design and evaluate complex ML pipelines without encountering common pitfalls. We present the rationale behind julearn's design, its core features, and showcase three examples of previously-published research projects. Julearn simplifies the access to ML providing an easy-to-use environment. With its design, unique features, simple interface, and practical documentation, it poses as a useful Python-based library for research projects.</p>","PeriodicalId":73157,"journal":{"name":"GigaByte (Hong Kong, China)","volume":"2024 ","pages":"gigabyte113"},"PeriodicalIF":0.0,"publicationDate":"2024-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10940896/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140144689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-06eCollection Date: 2024-01-01DOI: 10.46471/gigabyte.112
Yutang Chen, Roland Kölliker, Martin Mascher, Dario Copetti, Axel Himmelbach, Nils Stein, Bruno Studer
This work is an update and extension of the previously published article "Ultralong Oxford Nanopore Reads Enable the Development of a Reference-Grade Perennial Ryegrass Genome Assembly" by Frei et al. The published genome assembly of the doubled haploid perennial ryegrass (Lolium perenne L.) genotype Kyuss (Kyuss v1.0) marked a milestone for forage grass research and breeding. However, order and orientation errors may exist in the pseudo-chromosomes of Kyuss, since barley (Hordeum vulgare L.), which diverged 30 million years ago from perennial ryegrass, was used as the reference to scaffold Kyuss. To correct for structural errors possibly present in the published Kyuss assembly, we de novo assembled the genome again and generated 50-fold coverage high-throughput chromosome conformation capture (Hi-C) data to assist pseudo-chromosome construction. The resulting new chromosome-level assembly Kyuss v2.0 showed improved quality with high contiguity (contig N50 = 120 Mb), high completeness (total BUSCO score = 99%), high base-level accuracy (QV = 50), and correct pseudo-chromosome structure (validated by Hi-C contact map). This new assembly will serve as a better reference genome for Lolium spp. and greatly benefit the forage and turf grass research community.
{"title":"An improved chromosome-level genome assembly of perennial ryegrass (<i>Lolium perenne</i> L.).","authors":"Yutang Chen, Roland Kölliker, Martin Mascher, Dario Copetti, Axel Himmelbach, Nils Stein, Bruno Studer","doi":"10.46471/gigabyte.112","DOIUrl":"10.46471/gigabyte.112","url":null,"abstract":"<p><p>This work is an update and extension of the previously published article \"Ultralong Oxford Nanopore Reads Enable the Development of a Reference-Grade Perennial Ryegrass Genome Assembly\" by Frei <i>et al.</i> The published genome assembly of the doubled haploid perennial ryegrass (<i>Lolium perenne</i> L.) genotype Kyuss (Kyuss v1.0) marked a milestone for forage grass research and breeding. However, order and orientation errors may exist in the pseudo-chromosomes of Kyuss, since barley (<i>Hordeum vulgare</i> L.), which diverged 30 million years ago from perennial ryegrass, was used as the reference to scaffold Kyuss. To correct for structural errors possibly present in the published Kyuss assembly, we <i>de novo</i> assembled the genome again and generated 50-fold coverage high-throughput chromosome conformation capture (Hi-C) data to assist pseudo-chromosome construction. The resulting new chromosome-level assembly Kyuss v2.0 showed improved quality with high contiguity (contig N50 = 120 Mb), high completeness (total BUSCO score = 99%), high base-level accuracy (QV = 50), and correct pseudo-chromosome structure (validated by Hi-C contact map). This new assembly will serve as a better reference genome for <i>Lolium</i> spp. and greatly benefit the forage and turf grass research community.</p>","PeriodicalId":73157,"journal":{"name":"GigaByte (Hong Kong, China)","volume":"2024 ","pages":"gigabyte112"},"PeriodicalIF":0.0,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10940895/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140144688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-21eCollection Date: 2024-01-01DOI: 10.46471/gigabyte.107
Filipi Miranda Soares, Luís Ferreira Pires, Maria Carolina Garcia, Lidio Coradin, Natalia Pirani Ghilardi-Lopes, Rubens Rangel Silva, Aline Martins de Carvalho, Anand Gavai, Yamine Bouzembrak, Benildes Coura Moreira Dos Santos Maculan, Sheina Koffler, Uiara Bandineli Montedo, Debora Pignatari Drucker, Raquel Santiago, Maria Clara Peres de Carvalho, Ana Carolina da Silva Lima, Hillary Dandara Elias Gabriel, Stephanie Gabriele Mendonça de França, Karoline Reis de Almeida, Bárbara Junqueira Dos Santos, Antonio Mauro Saraiva
This paper presents two key data sets derived from the Pomar Urbano project. The first data set is a comprehensive catalog of edible fruit-bearing plant species, native or introduced to Brazil. The second data set, sourced from the iNaturalist platform, tracks the distribution and monitoring of these plants within urban landscapes across Brazil. The study includes data from the capitals of all 27 federative units of Brazil, focusing on the ten cities that contributed the most observations as of August 2023. The research emphasizes the significance of citizen science in urban biodiversity monitoring and its potential to contribute to various fields, including food and nutrition, creative industry, study of plant phenology, and machine learning applications. We expect the data sets presented in this paper to serve as resources for further studies in urban foraging, food security, cultural ecosystem services, and environmental sustainability.
{"title":"Citizen science data on urban forageable plants: a case study in Brazil.","authors":"Filipi Miranda Soares, Luís Ferreira Pires, Maria Carolina Garcia, Lidio Coradin, Natalia Pirani Ghilardi-Lopes, Rubens Rangel Silva, Aline Martins de Carvalho, Anand Gavai, Yamine Bouzembrak, Benildes Coura Moreira Dos Santos Maculan, Sheina Koffler, Uiara Bandineli Montedo, Debora Pignatari Drucker, Raquel Santiago, Maria Clara Peres de Carvalho, Ana Carolina da Silva Lima, Hillary Dandara Elias Gabriel, Stephanie Gabriele Mendonça de França, Karoline Reis de Almeida, Bárbara Junqueira Dos Santos, Antonio Mauro Saraiva","doi":"10.46471/gigabyte.107","DOIUrl":"10.46471/gigabyte.107","url":null,"abstract":"<p><p>This paper presents two key data sets derived from the <i>Pomar Urbano</i> project. The first data set is a comprehensive catalog of edible fruit-bearing plant species, native or introduced to Brazil. The second data set, sourced from the iNaturalist platform, tracks the distribution and monitoring of these plants within urban landscapes across Brazil. The study includes data from the capitals of all 27 federative units of Brazil, focusing on the ten cities that contributed the most observations as of August 2023. The research emphasizes the significance of citizen science in urban biodiversity monitoring and its potential to contribute to various fields, including food and nutrition, creative industry, study of plant phenology, and machine learning applications. We expect the data sets presented in this paper to serve as resources for further studies in urban foraging, food security, cultural ecosystem services, and environmental sustainability.</p>","PeriodicalId":73157,"journal":{"name":"GigaByte (Hong Kong, China)","volume":"2024 ","pages":"gigabyte107"},"PeriodicalIF":0.0,"publicationDate":"2024-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10905257/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140023509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-20eCollection Date: 2024-01-01DOI: 10.46471/gigabyte.109
Aleksandra Djordjevic, Junhua Li, Shuangsang Fang, Lei Cao, Marija Ivanovic
This paper introduces a new approach to cell clustering using the Variable Neighborhood Search (VNS) metaheuristic. The purpose of this method is to cluster cells based on both gene expression and spatial coordinates. Initially, we confronted this clustering challenge as an Integer Linear Programming minimization problem. Our approach introduced a novel model based on the VNS technique, demonstrating the efficacy in navigating the complexities of cell clustering. Notably, our method extends beyond conventional cell-type clustering to spatial domain clustering. This adaptability enables our algorithm to orchestrate clusters based on information gleaned from gene expression matrices and spatial coordinates. Our validation showed the superior performance of our method when compared to existing techniques. Our approach advances current clustering methodologies and can potentially be applied to several fields, from biomedical research to spatial data analysis.
{"title":"A novel variable neighborhood search approach for cell clustering for spatial transcriptomics.","authors":"Aleksandra Djordjevic, Junhua Li, Shuangsang Fang, Lei Cao, Marija Ivanovic","doi":"10.46471/gigabyte.109","DOIUrl":"10.46471/gigabyte.109","url":null,"abstract":"<p><p>This paper introduces a new approach to cell clustering using the Variable Neighborhood Search (VNS) metaheuristic. The purpose of this method is to cluster cells based on both gene expression and spatial coordinates. Initially, we confronted this clustering challenge as an Integer Linear Programming minimization problem. Our approach introduced a novel model based on the VNS technique, demonstrating the efficacy in navigating the complexities of cell clustering. Notably, our method extends beyond conventional cell-type clustering to spatial domain clustering. This adaptability enables our algorithm to orchestrate clusters based on information gleaned from gene expression matrices and spatial coordinates. Our validation showed the superior performance of our method when compared to existing techniques. Our approach advances current clustering methodologies and can potentially be applied to several fields, from biomedical research to spatial data analysis.</p>","PeriodicalId":73157,"journal":{"name":"GigaByte (Hong Kong, China)","volume":"2024 ","pages":"gigabyte109"},"PeriodicalIF":0.0,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10910296/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140029702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As genomic sequencing technology continues to advance, it becomes increasingly important to perform joint analyses of multiple datasets of transcriptomics. However, batch effect presents challenges for dataset integration, such as sequencing data measured on different platforms, and datasets collected at different times. Here, we report the development of BatchEval Pipeline, a batch effect workflow used to evaluate batch effect on dataset integration. The BatchEval Pipeline generates a comprehensive report, which consists of a series of HTML pages for assessment findings, including a main page, a raw dataset evaluation page, and several built-in methods evaluation pages. The main page exhibits basic information of the integrated datasets, a comprehensive score of batch effect, and the most recommended method for removing batch effect from the current datasets. The remaining pages exhibit evaluation details for the raw dataset, and evaluation results from the built-in batch effect removal methods after removing batch effect. This comprehensive report enables researchers to accurately identify and remove batch effects, resulting in more reliable and meaningful biological insights from integrated datasets. In summary, the BatchEval Pipeline represents a significant advancement in batch effect evaluation, and is a valuable tool to improve the accuracy and reliability of the experimental results.
Availability & implementation: The source code of the BatchEval Pipeline is available at https://github.com/STOmics/BatchEval.
{"title":"BatchEval Pipeline: batch effect evaluation workflow for multiple datasets joint analysis.","authors":"Chao Zhang, Qiang Kang, Mei Li, Hongqing Xie, Shuangsang Fang, Xun Xu","doi":"10.46471/gigabyte.108","DOIUrl":"10.46471/gigabyte.108","url":null,"abstract":"<p><p>As genomic sequencing technology continues to advance, it becomes increasingly important to perform joint analyses of multiple datasets of transcriptomics. However, batch effect presents challenges for dataset integration, such as sequencing data measured on different platforms, and datasets collected at different times. Here, we report the development of BatchEval Pipeline, a batch effect workflow used to evaluate batch effect on dataset integration. The BatchEval Pipeline generates a comprehensive report, which consists of a series of HTML pages for assessment findings, including a main page, a raw dataset evaluation page, and several built-in methods evaluation pages. The main page exhibits basic information of the integrated datasets, a comprehensive score of batch effect, and the most recommended method for removing batch effect from the current datasets. The remaining pages exhibit evaluation details for the raw dataset, and evaluation results from the built-in batch effect removal methods after removing batch effect. This comprehensive report enables researchers to accurately identify and remove batch effects, resulting in more reliable and meaningful biological insights from integrated datasets. In summary, the BatchEval Pipeline represents a significant advancement in batch effect evaluation, and is a valuable tool to improve the accuracy and reliability of the experimental results.</p><p><strong>Availability & implementation: </strong>The source code of the BatchEval Pipeline is available at https://github.com/STOmics/BatchEval.</p>","PeriodicalId":73157,"journal":{"name":"GigaByte (Hong Kong, China)","volume":"2024 ","pages":"gigabyte108"},"PeriodicalIF":0.0,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10905258/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140023508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-02-20eCollection Date: 2024-01-01DOI: 10.46471/gigabyte.110
Bohan Zhang, Mei Li, Qiang Kang, Zhonghan Deng, Hua Qin, Kui Su, Xiuwen Feng, Lichuan Chen, Huanlin Liu, Shuangsang Fang, Yong Zhang, Yuxiang Li, Susanne Brix, Xun Xu
In spatially resolved transcriptomics, Stereo-seq facilitates the analysis of large tissues at the single-cell level, offering subcellular resolution and centimeter-level field-of-view. Our previous work on StereoCell introduced a one-stop software using cell nuclei staining images and statistical methods to generate high-confidence single-cell spatial gene expression profiles for Stereo-seq data. With advancements allowing the acquisition of cell boundary information, such as cell membrane/wall staining images, we updated our software to a new version, STCellbin. Using cell nuclei staining images, STCellbin aligns cell membrane/wall staining images with spatial gene expression maps. Advanced cell segmentation ensures the detection of accurate cell boundaries, leading to more reliable single-cell spatial gene expression profiles. We verified that STCellbin can be applied to mouse liver (cell membranes) and Arabidopsis seed (cell walls) datasets, outperforming other methods. The improved capability of capturing single-cell gene expression profiles results in a deeper understanding of the contribution of single-cell phenotypes to tissue biology.
Availability & implementation: The source code of STCellbin is available at https://github.com/STOmics/STCellbin.
{"title":"Generating single-cell gene expression profiles for high-resolution spatial transcriptomics based on cell boundary images.","authors":"Bohan Zhang, Mei Li, Qiang Kang, Zhonghan Deng, Hua Qin, Kui Su, Xiuwen Feng, Lichuan Chen, Huanlin Liu, Shuangsang Fang, Yong Zhang, Yuxiang Li, Susanne Brix, Xun Xu","doi":"10.46471/gigabyte.110","DOIUrl":"10.46471/gigabyte.110","url":null,"abstract":"<p><p>In spatially resolved transcriptomics, Stereo-seq facilitates the analysis of large tissues at the single-cell level, offering subcellular resolution and centimeter-level field-of-view. Our previous work on StereoCell introduced a one-stop software using cell nuclei staining images and statistical methods to generate high-confidence single-cell spatial gene expression profiles for Stereo-seq data. With advancements allowing the acquisition of cell boundary information, such as cell membrane/wall staining images, we updated our software to a new version, STCellbin. Using cell nuclei staining images, STCellbin aligns cell membrane/wall staining images with spatial gene expression maps. Advanced cell segmentation ensures the detection of accurate cell boundaries, leading to more reliable single-cell spatial gene expression profiles. We verified that STCellbin can be applied to mouse liver (cell membranes) and <i>Arabidopsis</i> seed (cell walls) datasets, outperforming other methods. The improved capability of capturing single-cell gene expression profiles results in a deeper understanding of the contribution of single-cell phenotypes to tissue biology.</p><p><strong>Availability & implementation: </strong>The source code of STCellbin is available at https://github.com/STOmics/STCellbin.</p>","PeriodicalId":73157,"journal":{"name":"GigaByte (Hong Kong, China)","volume":"2024 ","pages":"gigabyte110"},"PeriodicalIF":0.0,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10905256/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140023510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}