The untranslated regions (UTRs) of genes significantly impact various biological processes, including transcription, posttranscriptional control, mRNA stability, localization, and translation efficiency. In functional areas of genomes, non-B DNA structures such as cruciform, curved, triplex, G-quadruplex, and Z-DNA structures are common and have an impact on cellular physiology. Although the role of these structures in cis-regulatory regions such as promoters is well established in eukaryotic genomes, their prevalence within UTRs across different eukaryotic classes has not been extensively documented. Our study investigated the prevalence of various non-B DNA motifs within the 5' and 3' UTRs across diverse eukaryotic species. Our comparative analysis encompassed the 5'-UTRs and 3'UTRs of 360 species representing diverse eukaryotic domains of life, including Arthropoda (Diptera, Hemiptera, and Hymenoptera), Chordata (Artiodactyla, Carnivora, Galliformes, Passeriformes, Primates, Rodentia, Squamata, Testudines), Magnoliophyta (Brassicales), Fabales (Poales), and Nematoda (Rhabditida), on the basis of datasets derived from the UTRdb. We observed that species belonging to taxonomic orders such as Rhabditida, Diptera, Brassicales, and Hemiptera present a prevalence of curved DNA motifs in their UTRs, whereas orders such as Testudines, Galliformes, and Rodentia present a preponderance of G-quadruplexes in both UTRs. The distribution of motifs is conserved across different taxonomic classes, although species-specific variations in motif preferences were also observed. Our research unequivocally illuminates the prevalence and potential functional implications of non-B DNA motifs, offering invaluable insights into the evolutionary and biological significance of these structures.
基因的非翻译区(UTR)对转录、转录后控制、mRNA 稳定性、定位和翻译效率等各种生物过程都有重大影响。在基因组的功能区,非 B 型 DNA 结构(如十字形、弯曲形、三重形、G-四重形和 Z-DNA 结构)很常见,并对细胞生理学产生影响。虽然这些结构在启动子等顺式调控区的作用在真核生物基因组中已得到证实,但它们在不同真核生物类别的 UTR 中的普遍性还没有得到广泛的记录。我们的研究调查了不同真核生物物种的 5' 和 3' UTR 中各种非 B DNA 主题的普遍性。我们的比较分析涵盖了 360 个物种的 5'-UTR 和 3'UTR,这些物种代表了真核生物的不同生命领域,包括节肢动物门(双翅目、半翅目和膜翅目)、脊索动物门(偶蹄目、食肉目、瘿形目、蝶形目和蝶形目)、真核生物门(真核生物)、真核生物门(真核生物)、真核生物门(真核生物)和真核生物门(真核生物)、在 UTRdb 数据集的基础上,我们对属于真核生物分类群的物种进行了分类,其中包括节肢动物门(双翅目、半翅目和膜翅目)、脊索动物门(有尾目、食肉目、胆形目、百灵目、灵长目、啮齿目、有鳞目、蹄目)、木兰纲(芸苔目)、梭形目和线虫纲(横纹目)。我们观察到,属于轮虫纲、双翅目、芸苔目和半翅目等分类目的物种在其 UTR 中普遍存在弯曲的 DNA 主题,而属于蹄目、胆形目和啮齿目等分类目的物种则在两个 UTR 中都存在大量的 G-四叠体。在不同的分类类别中,主题的分布是一致的,尽管在主题偏好方面也观察到了物种的特异性差异。我们的研究明确揭示了非 B 型 DNA 主题的普遍性和潜在功能意义,为了解这些结构的进化和生物学意义提供了宝贵的见解。
{"title":"Dissecting non-B DNA structural motifs in untranslated regions of eukaryotic genomes.","authors":"Aruna Sesha Chandrika Gummadi, Divya Kumari Muppa, Venakata Rajesh Yella","doi":"10.1186/s44342-024-00028-x","DOIUrl":"10.1186/s44342-024-00028-x","url":null,"abstract":"<p><p>The untranslated regions (UTRs) of genes significantly impact various biological processes, including transcription, posttranscriptional control, mRNA stability, localization, and translation efficiency. In functional areas of genomes, non-B DNA structures such as cruciform, curved, triplex, G-quadruplex, and Z-DNA structures are common and have an impact on cellular physiology. Although the role of these structures in cis-regulatory regions such as promoters is well established in eukaryotic genomes, their prevalence within UTRs across different eukaryotic classes has not been extensively documented. Our study investigated the prevalence of various non-B DNA motifs within the 5' and 3' UTRs across diverse eukaryotic species. Our comparative analysis encompassed the 5'-UTRs and 3'UTRs of 360 species representing diverse eukaryotic domains of life, including Arthropoda (Diptera, Hemiptera, and Hymenoptera), Chordata (Artiodactyla, Carnivora, Galliformes, Passeriformes, Primates, Rodentia, Squamata, Testudines), Magnoliophyta (Brassicales), Fabales (Poales), and Nematoda (Rhabditida), on the basis of datasets derived from the UTRdb. We observed that species belonging to taxonomic orders such as Rhabditida, Diptera, Brassicales, and Hemiptera present a prevalence of curved DNA motifs in their UTRs, whereas orders such as Testudines, Galliformes, and Rodentia present a preponderance of G-quadruplexes in both UTRs. The distribution of motifs is conserved across different taxonomic classes, although species-specific variations in motif preferences were also observed. Our research unequivocally illuminates the prevalence and potential functional implications of non-B DNA motifs, offering invaluable insights into the evolutionary and biological significance of these structures.</p>","PeriodicalId":94288,"journal":{"name":"Genomics & informatics","volume":"22 1","pages":"25"},"PeriodicalIF":0.0,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11603647/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142741945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-27DOI: 10.1186/s44342-024-00029-w
Jinkyeong Lee, Jeong-Ih Shin, Woo Young Cho, Kun Taek Park, Yeun-Jun Chung, Seung-Hyun Jung
Vibrio vulnificus, a gram-negative pathogenic bacterium, transmitted via undercooked seafood or contaminated seawater, causes septicemia and wound infections. In this study, we analyzed 15 clinical and 11 environmental isolates. In total, 20 sequence types (STs), including eight novel STs, were identified. Antibiotic resistance gene analysis commonly detected the cyclic AMP receptor protein (CRP) in both the clinical and environmental isolates. Interestingly, clinical and environmental isolates were non-susceptible to third-generation cephalosporins, such as ceftazidime and cefotaxime, complicating the treatment of V. vulnificus infection. Multiple antibiotic resistance (MAR) index ranged from 0.1 to 0.5, with clinical isolates showing a higher mean MAR index than the environmental isolates, indicating their broader spectrum of resistance. Notable, no quantitative (124.3 vs. 126.5) and qualitative (adherence, antiphagocytosis, and chemotaxis/motility) differences in virulence factors were observed between the environmental and clinical strains. The molecular characteristics identified in this study provide insights into the virulence of V. vulnificus strains in South Korea, highlighting the need for continuous surveillance of antibiotic resistance in emerging V. vulnificus strains.
{"title":"Genomic characteristics of Vibrio vulnificus strains isolated from clinical and environmental sources.","authors":"Jinkyeong Lee, Jeong-Ih Shin, Woo Young Cho, Kun Taek Park, Yeun-Jun Chung, Seung-Hyun Jung","doi":"10.1186/s44342-024-00029-w","DOIUrl":"10.1186/s44342-024-00029-w","url":null,"abstract":"<p><p>Vibrio vulnificus, a gram-negative pathogenic bacterium, transmitted via undercooked seafood or contaminated seawater, causes septicemia and wound infections. In this study, we analyzed 15 clinical and 11 environmental isolates. In total, 20 sequence types (STs), including eight novel STs, were identified. Antibiotic resistance gene analysis commonly detected the cyclic AMP receptor protein (CRP) in both the clinical and environmental isolates. Interestingly, clinical and environmental isolates were non-susceptible to third-generation cephalosporins, such as ceftazidime and cefotaxime, complicating the treatment of V. vulnificus infection. Multiple antibiotic resistance (MAR) index ranged from 0.1 to 0.5, with clinical isolates showing a higher mean MAR index than the environmental isolates, indicating their broader spectrum of resistance. Notable, no quantitative (124.3 vs. 126.5) and qualitative (adherence, antiphagocytosis, and chemotaxis/motility) differences in virulence factors were observed between the environmental and clinical strains. The molecular characteristics identified in this study provide insights into the virulence of V. vulnificus strains in South Korea, highlighting the need for continuous surveillance of antibiotic resistance in emerging V. vulnificus strains.</p>","PeriodicalId":94288,"journal":{"name":"Genomics & informatics","volume":"22 1","pages":"26"},"PeriodicalIF":0.0,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11603906/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142741946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-26DOI: 10.1186/s44342-024-00027-y
Anna Cho
Neuromuscular diseases (NMDs) are a group of rare disorders characterized by significant genetic and clinical complexity. Advances in genomics have revolutionized both the diagnosis and treatment of NMDs. While fewer than 30 NMDs had known genetic causes before the 1990s, more than 600 have now been identified, largely due to the adoption of next-generation sequencing (NGS) technologies such as whole-exome sequencing (WES) and whole-genome sequencing (WGS). These technologies have enabled more precise and earlier diagnoses, although the genetic complexity of many NMDs continues to pose challenges. Gene therapy has been a transformative breakthrough in the treatment of NMDs. In spinal muscular atrophy (SMA), therapies like nusinersen, onasemnogene abeparvovec, and risdiplam have dramatically improved patient outcomes. Similarly, Duchenne muscular dystrophy (DMD) has seen significant progress, most notably with the FDA approval of delandistrogene moxeparvovec, the first micro-dystrophin gene therapy. Despite these advancements, challenges remain, including the rarity of many NMDs, genetic heterogeneity, and the high costs associated with genomic technologies and therapies. Continued progress in gene therapy, RNA-based therapeutics, and personalized medicine holds promise for further breakthroughs in the management of these debilitating diseases.
{"title":"Neuromuscular diseases: genomics-driven advances.","authors":"Anna Cho","doi":"10.1186/s44342-024-00027-y","DOIUrl":"10.1186/s44342-024-00027-y","url":null,"abstract":"<p><p>Neuromuscular diseases (NMDs) are a group of rare disorders characterized by significant genetic and clinical complexity. Advances in genomics have revolutionized both the diagnosis and treatment of NMDs. While fewer than 30 NMDs had known genetic causes before the 1990s, more than 600 have now been identified, largely due to the adoption of next-generation sequencing (NGS) technologies such as whole-exome sequencing (WES) and whole-genome sequencing (WGS). These technologies have enabled more precise and earlier diagnoses, although the genetic complexity of many NMDs continues to pose challenges. Gene therapy has been a transformative breakthrough in the treatment of NMDs. In spinal muscular atrophy (SMA), therapies like nusinersen, onasemnogene abeparvovec, and risdiplam have dramatically improved patient outcomes. Similarly, Duchenne muscular dystrophy (DMD) has seen significant progress, most notably with the FDA approval of delandistrogene moxeparvovec, the first micro-dystrophin gene therapy. Despite these advancements, challenges remain, including the rarity of many NMDs, genetic heterogeneity, and the high costs associated with genomic technologies and therapies. Continued progress in gene therapy, RNA-based therapeutics, and personalized medicine holds promise for further breakthroughs in the management of these debilitating diseases.</p>","PeriodicalId":94288,"journal":{"name":"Genomics & informatics","volume":"22 1","pages":"24"},"PeriodicalIF":0.0,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11600827/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142735453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Human Phenotype Ontology (HPO) is widely used for annotating clinical text data, and sufficient annotation is crucial for the effective utilization of clinical texts. It was known that the use of LLMs can successfully extract symptoms and findings, but cannot annotate them with the HPO. We hypothesized that one of the potential issue for this is the lack of appropriate terms in the HPO. Therefore, during the Biomedical Linked Annotation Hackathon 8 (BLAH8), we attempted the following two tasks in order to grasp the overall picture of HPO. (1) Extract all HPO terms for each of the 23 HPO subclasses (defined as categories) directly under the HPO "Phenotypic abnormality" and then (2) search for major attributes in each of 23 categories. We employed LLM for these two tasks related to examining HPO and, at the same time, found that LLM didn't work well without ingenuity for tasks that lacked sentences and context. A manual search for terms within each category revealed that the HPO contains a mix of terms with four major attributes: (1) Disease Name, (2) Condition, (3) Test Data, and (4) Symptoms and Findings. Manual curation showed that the ratio of symptoms and findings varied from 0 to 93.1% across categories. For clinicians, who are end-users of medical terminology including HPO, it is difficult to understand ontologies. However, for good quality ontology is also important for good-quality data, and a clinician's help is essential. It is also important to make the overall picture and limitations of ontologies easy to understand in order to bring out the explanatory power of LLMs and artificial intelligence.
{"title":"Examining HPO by organ and system to facilitate practical use by clinicians.","authors":"Eisuke Dohi, Terue Takatsuki, Yuka Tateisi, Toyofumi Fujiwara, Yasunori Yamamoto","doi":"10.1186/s44342-024-00024-1","DOIUrl":"10.1186/s44342-024-00024-1","url":null,"abstract":"<p><p>The Human Phenotype Ontology (HPO) is widely used for annotating clinical text data, and sufficient annotation is crucial for the effective utilization of clinical texts. It was known that the use of LLMs can successfully extract symptoms and findings, but cannot annotate them with the HPO. We hypothesized that one of the potential issue for this is the lack of appropriate terms in the HPO. Therefore, during the Biomedical Linked Annotation Hackathon 8 (BLAH8), we attempted the following two tasks in order to grasp the overall picture of HPO. (1) Extract all HPO terms for each of the 23 HPO subclasses (defined as categories) directly under the HPO \"Phenotypic abnormality\" and then (2) search for major attributes in each of 23 categories. We employed LLM for these two tasks related to examining HPO and, at the same time, found that LLM didn't work well without ingenuity for tasks that lacked sentences and context. A manual search for terms within each category revealed that the HPO contains a mix of terms with four major attributes: (1) Disease Name, (2) Condition, (3) Test Data, and (4) Symptoms and Findings. Manual curation showed that the ratio of symptoms and findings varied from 0 to 93.1% across categories. For clinicians, who are end-users of medical terminology including HPO, it is difficult to understand ontologies. However, for good quality ontology is also important for good-quality data, and a clinician's help is essential. It is also important to make the overall picture and limitations of ontologies easy to understand in order to bring out the explanatory power of LLMs and artificial intelligence.</p>","PeriodicalId":94288,"journal":{"name":"Genomics & informatics","volume":"22 1","pages":"23"},"PeriodicalIF":0.0,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11559069/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142635517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-01DOI: 10.1186/s44342-024-00020-5
Jin-Dong Kim, Kousaku Okubo
The paper presents Anatomy3DExplorer, a customized ChatGPT designed as a natural language dialogue interface for exploring 3D models of anatomical structures. It illustrates the significant potential of large language models (LLMs) as user-friendly interfaces for database access. Furthermore, it showcases the seamless integration of LLMs and database APIs, within the GPTS framework, offering a promising and straightforward approach.
本文介绍了 Anatomy3DExplorer,这是一个定制的 ChatGPT,设计用作自然语言对话界面,用于探索解剖结构的 3D 模型。它展示了大型语言模型(LLM)作为用户友好型数据库访问界面的巨大潜力。此外,它还展示了在 GPTS 框架内 LLM 与数据库 API 的无缝集成,提供了一种前景广阔的直接方法。
{"title":"Customizing GPT for natural language dialogue interface in database access.","authors":"Jin-Dong Kim, Kousaku Okubo","doi":"10.1186/s44342-024-00020-5","DOIUrl":"10.1186/s44342-024-00020-5","url":null,"abstract":"<p><p>The paper presents Anatomy3DExplorer, a customized ChatGPT designed as a natural language dialogue interface for exploring 3D models of anatomical structures. It illustrates the significant potential of large language models (LLMs) as user-friendly interfaces for database access. Furthermore, it showcases the seamless integration of LLMs and database APIs, within the GPTS framework, offering a promising and straightforward approach.</p>","PeriodicalId":94288,"journal":{"name":"Genomics & informatics","volume":"22 1","pages":"22"},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11531191/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142565407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-31DOI: 10.1186/s44342-024-00023-2
Ramya Tekumalla, Juan M Banda
Electronic phenotyping involves a detailed analysis of both structured and unstructured data, employing rule-based methods, machine learning, natural language processing, and hybrid approaches. Currently, the development of accurate phenotype definitions demands extensive literature reviews and clinical experts, rendering the process time-consuming and inherently unscalable. Large language models offer a promising avenue for automating phenotype definition extraction but come with significant drawbacks, including reliability issues, the tendency to generate non-factual data ("hallucinations"), misleading results, and potential harm. To address these challenges, our study embarked on two key objectives: (1) defining a standard evaluation set to ensure large language models outputs are both useful and reliable and (2) evaluating various prompting approaches to extract phenotype definitions from large language models, assessing them with our established evaluation task. Our findings reveal promising results that still require human evaluation and validation for this task. However, enhanced phenotype extraction is possible, reducing the amount of time spent in literature review and evaluation.
{"title":"Towards automated phenotype definition extraction using large language models.","authors":"Ramya Tekumalla, Juan M Banda","doi":"10.1186/s44342-024-00023-2","DOIUrl":"10.1186/s44342-024-00023-2","url":null,"abstract":"<p><p>Electronic phenotyping involves a detailed analysis of both structured and unstructured data, employing rule-based methods, machine learning, natural language processing, and hybrid approaches. Currently, the development of accurate phenotype definitions demands extensive literature reviews and clinical experts, rendering the process time-consuming and inherently unscalable. Large language models offer a promising avenue for automating phenotype definition extraction but come with significant drawbacks, including reliability issues, the tendency to generate non-factual data (\"hallucinations\"), misleading results, and potential harm. To address these challenges, our study embarked on two key objectives: (1) defining a standard evaluation set to ensure large language models outputs are both useful and reliable and (2) evaluating various prompting approaches to extract phenotype definitions from large language models, assessing them with our established evaluation task. Our findings reveal promising results that still require human evaluation and validation for this task. However, enhanced phenotype extraction is possible, reducing the amount of time spent in literature review and evaluation.</p>","PeriodicalId":94288,"journal":{"name":"Genomics & informatics","volume":"22 1","pages":"21"},"PeriodicalIF":0.0,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11529293/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142559986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-31DOI: 10.1186/s44342-024-00022-3
Xinzhi Yao, Zhihan He, Jingbo Xia
The extraction of biological regulation events has been a key focus in the field of biomedical nature language processing (BioNLP). However, existing methods often encounter challenges such as cascading errors in text mining pipelines and limitations in topic coverage from the selected corpus. Fortunately, the emergence of large language models (LLMs) presents a potential solution due to their robust semantic understanding and extensive knowledge base. To explore this potential, our project at the Biomedical Linked Annotation Hackathon 8 (BLAH 8) investigates the feasibility of using LLMs to extract biological regulation events. Our findings, based on the analysis of rice literature, demonstrate the promising performance of LLMs in this task, while also highlighting several concerns that must be addressed in future LLM-based application in low-resource topic.
{"title":"Bioregulatory event extraction using large language models: a case study of rice literature.","authors":"Xinzhi Yao, Zhihan He, Jingbo Xia","doi":"10.1186/s44342-024-00022-3","DOIUrl":"10.1186/s44342-024-00022-3","url":null,"abstract":"<p><p>The extraction of biological regulation events has been a key focus in the field of biomedical nature language processing (BioNLP). However, existing methods often encounter challenges such as cascading errors in text mining pipelines and limitations in topic coverage from the selected corpus. Fortunately, the emergence of large language models (LLMs) presents a potential solution due to their robust semantic understanding and extensive knowledge base. To explore this potential, our project at the Biomedical Linked Annotation Hackathon 8 (BLAH 8) investigates the feasibility of using LLMs to extract biological regulation events. Our findings, based on the analysis of rice literature, demonstrate the promising performance of LLMs in this task, while also highlighting several concerns that must be addressed in future LLM-based application in low-resource topic.</p>","PeriodicalId":94288,"journal":{"name":"Genomics & informatics","volume":"22 1","pages":"20"},"PeriodicalIF":0.0,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11529424/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142560352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rapidly increasing the amount of short-read data generated by NGSs (new-generation sequencers) calls for the development of fast and accurate read alignment programs. The programs based on the hash table (BLAST) and Burrows-Wheeler transform (bwa-mem) are used, and the latter is known to give superior performance. We here present a new algorithm, a hybrid of hash table and suffix tree, which we designed to speed up the alignment of short reads against large reference sequences such as the human genome. The total turnaround time for processing one human genome sample (read depth of 30) is just 31 min with our system while that was more than 25 h with bwa-mem/gatk. The time for the aligner alone is 28 min for our system but around 2 h for bwa-mem. Our new algorithm is 4.4 times faster than bwa-mem while achieving similar accuracy. Variant calling and other downstream analyses after the alignment can be done with open-source tools such as SAMtools and Genome Analysis Toolkit (gatk) packages, as well as our own fast variant caller, which is well parallelized and much faster than gatk.
{"title":"Fast and accurate short-read alignment with hybrid hash-tree data structure.","authors":"Junichiro Makino, Toshikazu Ebisuzaki, Ryutaro Himeno, Yoshihide Hayashizaki","doi":"10.1186/s44342-024-00012-5","DOIUrl":"10.1186/s44342-024-00012-5","url":null,"abstract":"<p><p>Rapidly increasing the amount of short-read data generated by NGSs (new-generation sequencers) calls for the development of fast and accurate read alignment programs. The programs based on the hash table (BLAST) and Burrows-Wheeler transform (bwa-mem) are used, and the latter is known to give superior performance. We here present a new algorithm, a hybrid of hash table and suffix tree, which we designed to speed up the alignment of short reads against large reference sequences such as the human genome. The total turnaround time for processing one human genome sample (read depth of 30) is just 31 min with our system while that was more than 25 h with bwa-mem/gatk. The time for the aligner alone is 28 min for our system but around 2 h for bwa-mem. Our new algorithm is 4.4 times faster than bwa-mem while achieving similar accuracy. Variant calling and other downstream analyses after the alignment can be done with open-source tools such as SAMtools and Genome Analysis Toolkit (gatk) packages, as well as our own fast variant caller, which is well parallelized and much faster than gatk.</p>","PeriodicalId":94288,"journal":{"name":"Genomics & informatics","volume":"22 1","pages":"19"},"PeriodicalIF":0.0,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11520436/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142549935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-10DOI: 10.1186/s44342-024-00021-4
Nishad Thalhath
This report presents the findings of a project from the 8th Biomedical Linked Annotation Hackathon (BLAH) to explore lightweight technology stacks to enhance assistive linked annotations. Using modern JavaScript frameworks and edge functions, in-browser Named Entity Recognition (NER), serverless embedding and vector search within web interfaces, and efficient serverless full-text search were implemented. Through this experimental approach, a proof of concept to demonstrate the feasibility and performance of these technologies was demonstrated. The results show that lightweight stacks can significantly improve the efficiency and cost-effectiveness of annotation tools and provide a local-first, privacy-oriented, and secure alternative to traditional server-based solutions in various use cases. This work emphasizes the potential of developing annotation interfaces that are more responsive, scalable, and user-friendly, which would benefit bioinformatics researchers, practitioners, and software developers.
{"title":"Lightweight technology stacks for assistive linked annotations.","authors":"Nishad Thalhath","doi":"10.1186/s44342-024-00021-4","DOIUrl":"10.1186/s44342-024-00021-4","url":null,"abstract":"<p><p>This report presents the findings of a project from the 8th Biomedical Linked Annotation Hackathon (BLAH) to explore lightweight technology stacks to enhance assistive linked annotations. Using modern JavaScript frameworks and edge functions, in-browser Named Entity Recognition (NER), serverless embedding and vector search within web interfaces, and efficient serverless full-text search were implemented. Through this experimental approach, a proof of concept to demonstrate the feasibility and performance of these technologies was demonstrated. The results show that lightweight stacks can significantly improve the efficiency and cost-effectiveness of annotation tools and provide a local-first, privacy-oriented, and secure alternative to traditional server-based solutions in various use cases. This work emphasizes the potential of developing annotation interfaces that are more responsive, scalable, and user-friendly, which would benefit bioinformatics researchers, practitioners, and software developers.</p>","PeriodicalId":94288,"journal":{"name":"Genomics & informatics","volume":"22 1","pages":"17"},"PeriodicalIF":0.0,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11468380/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142402555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-10DOI: 10.1186/s44342-024-00025-0
Jin Sook Lee
Advancements in sequencing technology have significantly enhanced diagnostic capabilities for rare neurological diseases. This progress in molecular diagnostics can greatly impact clinical management and facilitate the development of personalized treatments for patients with rare neurological diseases. Neurologists with expertise should raise clinical awareness, as phenotyping remains crucial for making a clinical diagnosis, even in the genomics era. They should prioritize different types of genomic tests, considering both the benefits and the limitations inherent to each test. Notably, long-read sequencing is being utilized in cases suspected to involve repeat expansion disorders or complex structural variants. Repeat expansion disorders are highly prevalent in neurological diseases, particularly within the ataxia group. Significant efforts, including periodic reanalysis, data sharing, or integration of genomics with multi-omics studies, should be directed toward cases that remain undiagnosed after standard next-generation sequencing.
{"title":"Molecular diagnostic approach to rare neurological diseases from a clinician viewpoint.","authors":"Jin Sook Lee","doi":"10.1186/s44342-024-00025-0","DOIUrl":"10.1186/s44342-024-00025-0","url":null,"abstract":"<p><p>Advancements in sequencing technology have significantly enhanced diagnostic capabilities for rare neurological diseases. This progress in molecular diagnostics can greatly impact clinical management and facilitate the development of personalized treatments for patients with rare neurological diseases. Neurologists with expertise should raise clinical awareness, as phenotyping remains crucial for making a clinical diagnosis, even in the genomics era. They should prioritize different types of genomic tests, considering both the benefits and the limitations inherent to each test. Notably, long-read sequencing is being utilized in cases suspected to involve repeat expansion disorders or complex structural variants. Repeat expansion disorders are highly prevalent in neurological diseases, particularly within the ataxia group. Significant efforts, including periodic reanalysis, data sharing, or integration of genomics with multi-omics studies, should be directed toward cases that remain undiagnosed after standard next-generation sequencing.</p>","PeriodicalId":94288,"journal":{"name":"Genomics & informatics","volume":"22 1","pages":"18"},"PeriodicalIF":0.0,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11468364/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142402556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}