Acquiring meaningful representations of gene expression is essential for the accurate prediction of downstream regulatory tasks, such as identifying promoters and transcription factor binding sites. However, the current dependency on supervised learning, constrained by the limited availability of labeled genomic data, impedes the ability to develop robust predictive models with broad generalization capabilities. In response, recent advancements have pivoted towards the application of self-supervised training for DNA sequence modeling, enabling the adaptation of pre-trained genomic representations to a variety of downstream tasks. Departing from the straightforward application of masked language learning techniques to DNA sequences, approaches such as MoDNA enrich genome language modeling with prior biological knowledge. In this study, we advance DNA language models by utilizing the Motif-oriented DNA (MoDNA) pre-training framework, which is established for self-supervised learning at the pre-training stage and is flexible enough for application across different downstream tasks. MoDNA distinguishes itself by efficiently learning semantic-level genomic representations from an extensive corpus of unlabeled genome data, offering a significant improvement in computational efficiency over previous approaches. The framework is pre-trained on a comprehensive human genome dataset and fine-tuned for targeted downstream tasks. Our enhanced analysis and evaluation in promoter prediction and transcription factor binding site prediction have further validated MoDNA’s exceptional capabilities, emphasizing its contribution to advancements in genomic predictive modeling.
获取有意义的基因表达表征对于准确预测下游调控任务(如识别启动子和转录因子结合位点)至关重要。然而,由于标记基因组数据的可用性有限,目前对监督学习的依赖阻碍了开发具有广泛泛化能力的稳健预测模型的能力。为此,最近的研究进展转向将自我监督训练应用于 DNA 序列建模,使预先训练的基因组表征能够适应各种下游任务。与直接将遮蔽语言学习技术应用于 DNA 序列不同,MoDNA 等方法利用先验生物知识丰富了基因组语言建模。在本研究中,我们利用面向动机的 DNA(MoDNA)预训练框架推进了 DNA 语言模型,该框架在预训练阶段建立了自我监督学习,并可灵活应用于不同的下游任务。MoDNA 的与众不同之处在于,它能从大量未标记的基因组数据中高效地学习语义级基因组表征,与之前的方法相比,计算效率有了显著提高。该框架在全面的人类基因组数据集上进行了预训练,并针对目标下游任务进行了微调。我们在启动子预测和转录因子结合位点预测方面的强化分析和评估进一步验证了 MoDNA 的卓越能力,强调了它对基因组预测建模进步的贡献。
{"title":"Advancing DNA Language Models through Motif-Oriented Pre-Training with MoDNA","authors":"Weizhi An, Yuzhi Guo, Yatao Bian, Hehuan Ma, Jinyu Yang, Chunyuan Li, Junzhou Huang","doi":"10.3390/biomedinformatics4020085","DOIUrl":"https://doi.org/10.3390/biomedinformatics4020085","url":null,"abstract":"Acquiring meaningful representations of gene expression is essential for the accurate prediction of downstream regulatory tasks, such as identifying promoters and transcription factor binding sites. However, the current dependency on supervised learning, constrained by the limited availability of labeled genomic data, impedes the ability to develop robust predictive models with broad generalization capabilities. In response, recent advancements have pivoted towards the application of self-supervised training for DNA sequence modeling, enabling the adaptation of pre-trained genomic representations to a variety of downstream tasks. Departing from the straightforward application of masked language learning techniques to DNA sequences, approaches such as MoDNA enrich genome language modeling with prior biological knowledge. In this study, we advance DNA language models by utilizing the Motif-oriented DNA (MoDNA) pre-training framework, which is established for self-supervised learning at the pre-training stage and is flexible enough for application across different downstream tasks. MoDNA distinguishes itself by efficiently learning semantic-level genomic representations from an extensive corpus of unlabeled genome data, offering a significant improvement in computational efficiency over previous approaches. The framework is pre-trained on a comprehensive human genome dataset and fine-tuned for targeted downstream tasks. Our enhanced analysis and evaluation in promoter prediction and transcription factor binding site prediction have further validated MoDNA’s exceptional capabilities, emphasizing its contribution to advancements in genomic predictive modeling.","PeriodicalId":72394,"journal":{"name":"BioMedInformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141351387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-12DOI: 10.3390/biomedinformatics4020086
E. K. Oladipo, Stephen Feranmi Adeyemo, M. Akinboade, Temitope Michael Akinleye, Kehinde Favour Siyanbola, Precious Ayomide Adeogun, Victor Michael Ogunfidodo, Christiana Adewumi Adekunle, Olubunmi Ayobami Elutade, Esther Eghogho Omoathebu, Blessing Oluwatunmise Taiwo, Elizabeth Olawumi Akindiya, Lucy Ochola, H. Onyeaka
Background: Influenza D Virus (IDV) presents a possible threat to animal and human health, necessitating the development of effective vaccines. Although no human illness linked to IDV has been reported, the possibility of human susceptibility to infection remains uncertain. Hence, there is a need for an animal vaccine to be designed. Such a vaccine will contribute to preventing and controlling IDV outbreaks and developing effective countermeasures against this emerging pathogen. This study, therefore, aimed to design an mRNA vaccine construct against IDV using immunoinformatic methods and evaluate its potential efficacy. Methods: A comprehensive methodology involving epitope prediction, vaccine construction, and structural analysis was employed. Viral sequences from six continents were collected and analyzed. A total of 88 Hemagglutinin Esterase Fusion (HEF) sequences from IDV isolates were obtained, of which 76 were identified as antigenic. Different bioinformatics tools were used to identify preferred CTL, HTL, and B-cell epitopes. The epitopes underwent thorough analysis, and those that can induce a lasting immunological response were selected for the construction. Results: The vaccine prototype comprised nine epitopes, an adjuvant, MHC I-targeting domain (MITD), Kozaq, 3′ UTR, 5′ UTR, and specific linkers. The mRNA vaccine construct exhibited antigenicity, non-toxicity, and non-allergenicity, with favourable physicochemical properties. The secondary and tertiary structure analyses revealed a stable and accurate vaccine construct. Molecular docking simulations also demonstrated strong binding affinity with toll-like receptors. Conclusions: The study provides a promising framework for developing an effective mRNA vaccine against IDV, highlighting its potential for mitigating the global impact of this viral infection. Further experimental studies are needed to confirm the vaccine’s efficacy and safety.
{"title":"Utilizing Immunoinformatics for mRNA Vaccine Design against Influenza D Virus","authors":"E. K. Oladipo, Stephen Feranmi Adeyemo, M. Akinboade, Temitope Michael Akinleye, Kehinde Favour Siyanbola, Precious Ayomide Adeogun, Victor Michael Ogunfidodo, Christiana Adewumi Adekunle, Olubunmi Ayobami Elutade, Esther Eghogho Omoathebu, Blessing Oluwatunmise Taiwo, Elizabeth Olawumi Akindiya, Lucy Ochola, H. Onyeaka","doi":"10.3390/biomedinformatics4020086","DOIUrl":"https://doi.org/10.3390/biomedinformatics4020086","url":null,"abstract":"Background: Influenza D Virus (IDV) presents a possible threat to animal and human health, necessitating the development of effective vaccines. Although no human illness linked to IDV has been reported, the possibility of human susceptibility to infection remains uncertain. Hence, there is a need for an animal vaccine to be designed. Such a vaccine will contribute to preventing and controlling IDV outbreaks and developing effective countermeasures against this emerging pathogen. This study, therefore, aimed to design an mRNA vaccine construct against IDV using immunoinformatic methods and evaluate its potential efficacy. Methods: A comprehensive methodology involving epitope prediction, vaccine construction, and structural analysis was employed. Viral sequences from six continents were collected and analyzed. A total of 88 Hemagglutinin Esterase Fusion (HEF) sequences from IDV isolates were obtained, of which 76 were identified as antigenic. Different bioinformatics tools were used to identify preferred CTL, HTL, and B-cell epitopes. The epitopes underwent thorough analysis, and those that can induce a lasting immunological response were selected for the construction. Results: The vaccine prototype comprised nine epitopes, an adjuvant, MHC I-targeting domain (MITD), Kozaq, 3′ UTR, 5′ UTR, and specific linkers. The mRNA vaccine construct exhibited antigenicity, non-toxicity, and non-allergenicity, with favourable physicochemical properties. The secondary and tertiary structure analyses revealed a stable and accurate vaccine construct. Molecular docking simulations also demonstrated strong binding affinity with toll-like receptors. Conclusions: The study provides a promising framework for developing an effective mRNA vaccine against IDV, highlighting its potential for mitigating the global impact of this viral infection. Further experimental studies are needed to confirm the vaccine’s efficacy and safety.","PeriodicalId":72394,"journal":{"name":"BioMedInformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141352150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-09DOI: 10.3390/biomedinformatics4020083
Anna Sabatini, Costanza Cenerini, Luca Vollero, D. Pau
Background: Continuous glucose monitoring (CGM) systems offer the advantage of noninvasive monitoring and continuous data on glucose fluctuations. This study introduces a new model that enables the generation of synthetic but realistic databases that integrate physiological variables and sensor attributes into a dataset generation model and this, in turn, enables the design of improved CGM systems. Methods: The presented approach uses a combination of physiological data and sensor characteristics to construct a model that considers the impact of these variables on the accuracy of CGM measures. A dataset of 500 sensor responses over a 15-day period is generated and analyzed using machine learning algorithms (random forest regressor and support vector regressor). Results: The random forest and support vector regression models achieved Mean Absolute Errors (MAEs) of 16.13 mg/dL and 16.22 mg/dL, respectively. In contrast, models trained solely on single sensor outputs recorded an average MAE of 11.01±5.12 mg/dL. These findings demonstrate the variable impact of integrating multiple data sources on the predictive accuracy of CGM systems, as well as the complexity of the dataset. Conclusions: This approach provides a foundation for developing more precise algorithms and introduces its initial application of Tiny Machine Control Units (MCUs). More research is recommended to refine these models and validate their effectiveness in clinical settings.
{"title":"Calibrating Glucose Sensors at the Edge: A Stress Generation Model for Tiny ML Drift Compensation","authors":"Anna Sabatini, Costanza Cenerini, Luca Vollero, D. Pau","doi":"10.3390/biomedinformatics4020083","DOIUrl":"https://doi.org/10.3390/biomedinformatics4020083","url":null,"abstract":"Background: Continuous glucose monitoring (CGM) systems offer the advantage of noninvasive monitoring and continuous data on glucose fluctuations. This study introduces a new model that enables the generation of synthetic but realistic databases that integrate physiological variables and sensor attributes into a dataset generation model and this, in turn, enables the design of improved CGM systems. Methods: The presented approach uses a combination of physiological data and sensor characteristics to construct a model that considers the impact of these variables on the accuracy of CGM measures. A dataset of 500 sensor responses over a 15-day period is generated and analyzed using machine learning algorithms (random forest regressor and support vector regressor). Results: The random forest and support vector regression models achieved Mean Absolute Errors (MAEs) of 16.13 mg/dL and 16.22 mg/dL, respectively. In contrast, models trained solely on single sensor outputs recorded an average MAE of 11.01±5.12 mg/dL. These findings demonstrate the variable impact of integrating multiple data sources on the predictive accuracy of CGM systems, as well as the complexity of the dataset. Conclusions: This approach provides a foundation for developing more precise algorithms and introduces its initial application of Tiny Machine Control Units (MCUs). More research is recommended to refine these models and validate their effectiveness in clinical settings.","PeriodicalId":72394,"journal":{"name":"BioMedInformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141367813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-09DOI: 10.3390/biomedinformatics4020084
Y. Matsuzaka, R. Yashiro
In vaccine development, many use the spike protein (S protein), which has multiple “spike-like” structures protruding from the spherical structure of the coronavirus, as an antigen. However, there are concerns about its effectiveness and toxicity. When S protein is used in a vaccine, its ability to attack viruses may be weak, and its effectiveness in eliciting immunity will only last for a short period of time. Moreover, it may cause “antibody-dependent immune enhancement”, which can enhance infections. In addition, the three-dimensional (3D) structure of epitopes is essential for functional analysis and structure-based vaccine design. Additionally, during viral infection, large amounts of extracellular vesicles (EVs) are secreted from infected cells, which function as a communication network between cells and coordinate the response to infection. Under conditions where SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) molecular vaccination produces overwhelming SARS-CoV-2 spike glycoprotein, a significant proportion of the overproduced intracellular spike glycoprotein is transported via EVs. Therefore, it will be important to understand the infection mechanisms of SARA-CoV-2 via EV-dependent and EV-independent uptake into cells and to model the infection processes based on 3D structural features at interaction sites.
在疫苗开发过程中,许多人使用尖峰蛋白(S 蛋白)作为抗原,这种蛋白从冠状病毒的球形结构中突出多个 "尖峰状 "结构。然而,人们对其有效性和毒性表示担忧。在疫苗中使用 S 蛋白时,其攻击病毒的能力可能较弱,激发免疫力的效果只能维持很短的时间。此外,它还可能引起 "抗体依赖性免疫增强",从而增强感染。此外,表位的三维(3D)结构对于功能分析和基于结构的疫苗设计至关重要。此外,在病毒感染过程中,受感染细胞会分泌大量的胞外囊泡 (EVs),这些囊泡可作为细胞间的通信网络,协调对感染的反应。在 SARS-CoV-2(严重急性呼吸系统综合征冠状病毒 2)分子疫苗接种会产生大量 SARS-CoV-2 棘突糖蛋白的情况下,细胞内过量产生的棘突糖蛋白有很大一部分是通过 EVs 运输的。因此,了解 SARA-CoV-2 通过 EV 依赖性和 EV 非依赖性摄入细胞的感染机制,并根据相互作用位点的三维结构特征建立感染过程模型将非常重要。
{"title":"Understanding the Molecular Actions of Spike Glycoprotein in SARS-CoV-2 and Issues of a Novel Therapeutic Strategy for the COVID-19 Vaccine","authors":"Y. Matsuzaka, R. Yashiro","doi":"10.3390/biomedinformatics4020084","DOIUrl":"https://doi.org/10.3390/biomedinformatics4020084","url":null,"abstract":"In vaccine development, many use the spike protein (S protein), which has multiple “spike-like” structures protruding from the spherical structure of the coronavirus, as an antigen. However, there are concerns about its effectiveness and toxicity. When S protein is used in a vaccine, its ability to attack viruses may be weak, and its effectiveness in eliciting immunity will only last for a short period of time. Moreover, it may cause “antibody-dependent immune enhancement”, which can enhance infections. In addition, the three-dimensional (3D) structure of epitopes is essential for functional analysis and structure-based vaccine design. Additionally, during viral infection, large amounts of extracellular vesicles (EVs) are secreted from infected cells, which function as a communication network between cells and coordinate the response to infection. Under conditions where SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) molecular vaccination produces overwhelming SARS-CoV-2 spike glycoprotein, a significant proportion of the overproduced intracellular spike glycoprotein is transported via EVs. Therefore, it will be important to understand the infection mechanisms of SARA-CoV-2 via EV-dependent and EV-independent uptake into cells and to model the infection processes based on 3D structural features at interaction sites.","PeriodicalId":72394,"journal":{"name":"BioMedInformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141367081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-07DOI: 10.3390/biomedinformatics4020081
J. Carreras, R. Hamoudi
Background: Diffuse large B-cell lymphoma (DLBCL) is one of the most frequent lymphomas. DLBCL is phenotypically, genetically, and clinically heterogeneous. Aim: We aim to identify new prognostic markers. Methods: We performed anomaly detection analysis, other artificial intelligence techniques, and conventional statistics using gene expression data of 414 patients from the Lymphoma/Leukemia Molecular Profiling Project (GSE10846), and immunohistochemistry in 10 reactive tonsils and 30 DLBCL cases. Results: First, an unsupervised anomaly detection analysis pinpointed outliers (anomalies) in the series, and 12 genes were identified: DPM2, TRAPPC1, HYAL2, TRIM35, NUDT18, TMEM219, CHCHD10, IGFBP7, LAMTOR2, ZNF688, UBL7, and RELB, which belonged to the apoptosis, MAPK, MTOR, and NF-kB pathways. Second, these 12 genes were used to predict overall survival using machine learning, artificial neural networks, and conventional statistics. In a multivariate Cox regression analysis, high expressions of HYAL2 and UBL7 were correlated with poor overall survival, whereas TRAPPC1, IGFBP7, and RELB were correlated with good overall survival (p < 0.01). As a single marker and only in RCHOP-like treated cases, the prognostic value of RELB was confirmed using GSEA analysis and Kaplan–Meier with log-rank test and validated in the TCGA and GSE57611 datasets. Anomaly detection analysis was successfully tested in the GSE31312 and GSE117556 datasets. Using immunohistochemistry, RELB was positive in B-lymphocytes and macrophage/dendritic-like cells, and correlation with HLA DP-DR, SIRPA, CD85A (LILRB3), PD-L1, MARCO, and TOX was explored. Conclusions: Anomaly detection and other bioinformatic techniques successfully predicted the prognosis of DLBCL, and high RELB was associated with a favorable prognosis.
{"title":"Anomaly Detection and Artificial Intelligence Identified the Pathogenic Role of Apoptosis and RELB Proto-Oncogene, NF-kB Subunit in Diffuse Large B-Cell Lymphoma","authors":"J. Carreras, R. Hamoudi","doi":"10.3390/biomedinformatics4020081","DOIUrl":"https://doi.org/10.3390/biomedinformatics4020081","url":null,"abstract":"Background: Diffuse large B-cell lymphoma (DLBCL) is one of the most frequent lymphomas. DLBCL is phenotypically, genetically, and clinically heterogeneous. Aim: We aim to identify new prognostic markers. Methods: We performed anomaly detection analysis, other artificial intelligence techniques, and conventional statistics using gene expression data of 414 patients from the Lymphoma/Leukemia Molecular Profiling Project (GSE10846), and immunohistochemistry in 10 reactive tonsils and 30 DLBCL cases. Results: First, an unsupervised anomaly detection analysis pinpointed outliers (anomalies) in the series, and 12 genes were identified: DPM2, TRAPPC1, HYAL2, TRIM35, NUDT18, TMEM219, CHCHD10, IGFBP7, LAMTOR2, ZNF688, UBL7, and RELB, which belonged to the apoptosis, MAPK, MTOR, and NF-kB pathways. Second, these 12 genes were used to predict overall survival using machine learning, artificial neural networks, and conventional statistics. In a multivariate Cox regression analysis, high expressions of HYAL2 and UBL7 were correlated with poor overall survival, whereas TRAPPC1, IGFBP7, and RELB were correlated with good overall survival (p < 0.01). As a single marker and only in RCHOP-like treated cases, the prognostic value of RELB was confirmed using GSEA analysis and Kaplan–Meier with log-rank test and validated in the TCGA and GSE57611 datasets. Anomaly detection analysis was successfully tested in the GSE31312 and GSE117556 datasets. Using immunohistochemistry, RELB was positive in B-lymphocytes and macrophage/dendritic-like cells, and correlation with HLA DP-DR, SIRPA, CD85A (LILRB3), PD-L1, MARCO, and TOX was explored. Conclusions: Anomaly detection and other bioinformatic techniques successfully predicted the prognosis of DLBCL, and high RELB was associated with a favorable prognosis.","PeriodicalId":72394,"journal":{"name":"BioMedInformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141375619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-07DOI: 10.3390/biomedinformatics4020082
Bernardo Gonçalves, Mariana Silva, Luísa Vieira, Pedro Vieira
Current computer vision models require a significant amount of annotated data to improve their performance in a particular task. However, obtaining the required annotated data is challenging, especially in medicine. Hence, data augmentation techniques play a crucial role. In recent years, generative models have been used to create artificial medical images, which have shown promising results. This study aimed to use a state-of-the-art generative model, StyleGAN3, to generate realistic synthetic abdominal magnetic resonance images. These images will be evaluated using quantitative metrics and qualitative assessments by medical professionals. For this purpose, an abdominal MRI dataset acquired at Garcia da Horta Hospital in Almada, Portugal, was used. A subset containing only axial gadolinium-enhanced slices was used to train the model. The obtained Fréchet inception distance value (12.89) aligned with the state of the art, and a medical expert confirmed the significant realism and quality of the images. However, specific issues were identified in the generated images, such as texture variations, visual artefacts and anatomical inconsistencies. Despite these, this work demonstrated that StyleGAN3 is a viable solution to synthesise realistic medical imaging data, particularly in abdominal imaging.
当前的计算机视觉模型需要大量的注释数据来提高其在特定任务中的性能。然而,获取所需的注释数据具有挑战性,尤其是在医学领域。因此,数据增强技术发挥着至关重要的作用。近年来,生成模型已被用于创建人工医学图像,并取得了可喜的成果。本研究旨在使用最先进的生成模型 StyleGAN3 生成逼真的合成腹部磁共振图像。这些图像将通过定量指标和医学专业人员的定性评估进行评估。为此,我们使用了葡萄牙阿尔马达 Garcia da Horta 医院获得的腹部磁共振成像数据集。该数据集仅包含轴向钆增强切片,用于训练模型。所获得的弗雷谢内距值(12.89)与目前的技术水平相符,一位医学专家也证实了图像的逼真度和质量。不过,在生成的图像中也发现了一些具体问题,如纹理变化、视觉伪影和解剖不一致。尽管如此,这项工作还是证明了 StyleGAN3 是合成逼真医学成像数据的可行解决方案,尤其是在腹部成像方面。
{"title":"Abdominal MRI Unconditional Synthesis with Medical Assessment","authors":"Bernardo Gonçalves, Mariana Silva, Luísa Vieira, Pedro Vieira","doi":"10.3390/biomedinformatics4020082","DOIUrl":"https://doi.org/10.3390/biomedinformatics4020082","url":null,"abstract":"Current computer vision models require a significant amount of annotated data to improve their performance in a particular task. However, obtaining the required annotated data is challenging, especially in medicine. Hence, data augmentation techniques play a crucial role. In recent years, generative models have been used to create artificial medical images, which have shown promising results. This study aimed to use a state-of-the-art generative model, StyleGAN3, to generate realistic synthetic abdominal magnetic resonance images. These images will be evaluated using quantitative metrics and qualitative assessments by medical professionals. For this purpose, an abdominal MRI dataset acquired at Garcia da Horta Hospital in Almada, Portugal, was used. A subset containing only axial gadolinium-enhanced slices was used to train the model. The obtained Fréchet inception distance value (12.89) aligned with the state of the art, and a medical expert confirmed the significant realism and quality of the images. However, specific issues were identified in the generated images, such as texture variations, visual artefacts and anatomical inconsistencies. Despite these, this work demonstrated that StyleGAN3 is a viable solution to synthesise realistic medical imaging data, particularly in abdominal imaging.","PeriodicalId":72394,"journal":{"name":"BioMedInformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141373128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-06DOI: 10.3390/biomedinformatics4020080
Alae Eddine El Hmimdi, Zoï Kapoula
In this study, the challenges posed by limited annotated medical data in the field of eye movement AI analysis are addressed through the introduction of a novel physiologically based gaze data augmentation library. Unlike traditional augmentation methods, which may introduce artifacts and alter pathological features in medical datasets, the proposed library emulates natural head movements during gaze data collection. This approach enhances sample diversity without compromising authenticity. The library evaluation was conducted on both CNN and hybrid architectures using distinct datasets, demonstrating its effectiveness in regularizing the training process and improving generalization. What is particularly noteworthy is the achievement of a macro F1 score of up to 79% when trained using the proposed augmentation (EMULATE) with the three HTCE variants. This pioneering approach leverages domain-specific knowledge to contribute to the robustness and authenticity of deep learning models in the medical domain.
在本研究中,通过引入基于生理学的新型凝视数据增强库,解决了眼动人工智能分析领域中有限注释医疗数据带来的挑战。传统的增强方法可能会在医疗数据集中引入伪影并改变病理特征,与之不同的是,本研究提出的库可在凝视数据收集过程中模拟自然的头部运动。这种方法既增强了样本的多样性,又不影响真实性。利用不同的数据集对 CNN 和混合架构进行了库评估,证明了其在规范化训练过程和提高泛化方面的有效性。尤其值得注意的是,在使用建议的增强(EMULATE)和三种 HTCE 变体进行训练时,宏观 F1 分数高达 79%。这种开创性的方法利用了特定领域的知识,有助于提高深度学习模型在医疗领域的鲁棒性和真实性。
{"title":"Physiological Data Augmentation for Eye Movement Gaze in Deep Learning","authors":"Alae Eddine El Hmimdi, Zoï Kapoula","doi":"10.3390/biomedinformatics4020080","DOIUrl":"https://doi.org/10.3390/biomedinformatics4020080","url":null,"abstract":"In this study, the challenges posed by limited annotated medical data in the field of eye movement AI analysis are addressed through the introduction of a novel physiologically based gaze data augmentation library. Unlike traditional augmentation methods, which may introduce artifacts and alter pathological features in medical datasets, the proposed library emulates natural head movements during gaze data collection. This approach enhances sample diversity without compromising authenticity. The library evaluation was conducted on both CNN and hybrid architectures using distinct datasets, demonstrating its effectiveness in regularizing the training process and improving generalization. What is particularly noteworthy is the achievement of a macro F1 score of up to 79% when trained using the proposed augmentation (EMULATE) with the three HTCE variants. This pioneering approach leverages domain-specific knowledge to contribute to the robustness and authenticity of deep learning models in the medical domain.","PeriodicalId":72394,"journal":{"name":"BioMedInformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141380161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-06DOI: 10.3390/biomedinformatics4020079
Zamara Mariam, Sarfaraz K. Niazi, Matthias Magoola
This article delves into the intersection of generative AI and digital twins within drug discovery, exploring their synergistic potential to revolutionize pharmaceutical research and development. Through various instances and examples, we illuminate how generative AI algorithms, capable of simulating vast chemical spaces and predicting molecular properties, are increasingly integrated with digital twins of biological systems to expedite drug discovery. By harnessing the power of computational models and machine learning, researchers can design novel compounds tailored to specific targets, optimize drug candidates, and simulate their behavior within virtual biological environments. This paradigm shift offers unprecedented opportunities for accelerating drug development, reducing costs, and, ultimately, improving patient outcomes. As we navigate this rapidly evolving landscape, collaboration between interdisciplinary teams and continued innovation will be paramount in realizing the promise of generative AI and digital twins in advancing drug discovery.
{"title":"Unlocking the Future of Drug Development: Generative AI, Digital Twins, and Beyond","authors":"Zamara Mariam, Sarfaraz K. Niazi, Matthias Magoola","doi":"10.3390/biomedinformatics4020079","DOIUrl":"https://doi.org/10.3390/biomedinformatics4020079","url":null,"abstract":"This article delves into the intersection of generative AI and digital twins within drug discovery, exploring their synergistic potential to revolutionize pharmaceutical research and development. Through various instances and examples, we illuminate how generative AI algorithms, capable of simulating vast chemical spaces and predicting molecular properties, are increasingly integrated with digital twins of biological systems to expedite drug discovery. By harnessing the power of computational models and machine learning, researchers can design novel compounds tailored to specific targets, optimize drug candidates, and simulate their behavior within virtual biological environments. This paradigm shift offers unprecedented opportunities for accelerating drug development, reducing costs, and, ultimately, improving patient outcomes. As we navigate this rapidly evolving landscape, collaboration between interdisciplinary teams and continued innovation will be paramount in realizing the promise of generative AI and digital twins in advancing drug discovery.","PeriodicalId":72394,"journal":{"name":"BioMedInformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141381497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-03DOI: 10.3390/biomedinformatics4020078
Peter J. Hunt, Mohammad N. Noori, S. Hazelwood, N. Noori, Wael A. Altabey
Total knee arthroplasty (TKA) is one of the most commonly performed orthopedic surgeries, with nearly one million performed in 2020 in the United States alone. Changing patient demographics, predominately indicated by increases in younger, more active, and more obese patients undergoing TKA, poses a challenge to orthopedic surgeons as these factors present a greater risk of long-term complications. Historically, cemented TKA has been the gold standard for fixation, but long-term aseptic loosening continues to be a risk for cemented implants. Cementless TKA, which relies on the surface morphology of a porous coating for biologic fixation of implant to bone, may provide improved long-term survivorship compared with cement. The quality of this bond is dependent on an interference fit and the roughness, or coefficient of friction, between the implant and the bonebone. Stress shielding is a measure of the difference in the stress experienced by implanted bone versus surrounding native bone. A finite element model (FEM) can be used to quantify and better understand stress shielding in order to better evaluate and optimize implant design. In this study, a FEM was constructed to investigate how the surface coating of cementless implants (coefficient of friction) and the location of the coating application affected the stress-shielding response in the tibia. It was determined that the stress distribution in the native tibia surrounding a cementless TKA implant was dependent on the coefficient of friction applied at the tip of the implant’s stem. Materials with lower friction coefficients applied to the stem tip resulted in higher compressive stress experienced by implanted bone, and more favorable overall stress-shielding responses.
{"title":"A Study on the Effects of Cementless Total Knee Arthroplasty Implants’ Surface Morphology via Finite Element Analysis","authors":"Peter J. Hunt, Mohammad N. Noori, S. Hazelwood, N. Noori, Wael A. Altabey","doi":"10.3390/biomedinformatics4020078","DOIUrl":"https://doi.org/10.3390/biomedinformatics4020078","url":null,"abstract":"Total knee arthroplasty (TKA) is one of the most commonly performed orthopedic surgeries, with nearly one million performed in 2020 in the United States alone. Changing patient demographics, predominately indicated by increases in younger, more active, and more obese patients undergoing TKA, poses a challenge to orthopedic surgeons as these factors present a greater risk of long-term complications. Historically, cemented TKA has been the gold standard for fixation, but long-term aseptic loosening continues to be a risk for cemented implants. Cementless TKA, which relies on the surface morphology of a porous coating for biologic fixation of implant to bone, may provide improved long-term survivorship compared with cement. The quality of this bond is dependent on an interference fit and the roughness, or coefficient of friction, between the implant and the bonebone. Stress shielding is a measure of the difference in the stress experienced by implanted bone versus surrounding native bone. A finite element model (FEM) can be used to quantify and better understand stress shielding in order to better evaluate and optimize implant design. In this study, a FEM was constructed to investigate how the surface coating of cementless implants (coefficient of friction) and the location of the coating application affected the stress-shielding response in the tibia. It was determined that the stress distribution in the native tibia surrounding a cementless TKA implant was dependent on the coefficient of friction applied at the tip of the implant’s stem. Materials with lower friction coefficients applied to the stem tip resulted in higher compressive stress experienced by implanted bone, and more favorable overall stress-shielding responses.","PeriodicalId":72394,"journal":{"name":"BioMedInformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141269876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-22DOI: 10.3390/biomedinformatics4020077
Soukaina Amniouel, Keertana Yalamanchili, Sreenidhi Sankararaman, M. S. Jafri
Background: Ovarian cancer (OC) is the most lethal gynecological cancer in the United States. Among the different types of OC, serous ovarian cancer (SOC) stands out as the most prevalent. Transcriptomics techniques generate extensive gene expression data, yet only a few of these genes are relevant to clinical diagnosis. Methods: Methods for feature selection (FS) address the challenges of high dimensionality in extensive datasets. This study proposes a computational framework that applies FS techniques to identify genes highly associated with platinum-based chemotherapy response on SOC patients. Using SOC datasets from the Gene Expression Omnibus (GEO) database, LASSO and varSelRF FS methods were employed. Machine learning classification algorithms such as random forest (RF) and support vector machine (SVM) were also used to evaluate the performance of the models. Results: The proposed framework has identified biomarkers panels with 9 and 10 genes that are highly correlated with platinum–paclitaxel and platinum-only response in SOC patients, respectively. The predictive models have been trained using the identified gene signatures and accuracy of above 90% was achieved. Conclusions: In this study, we propose that applying multiple feature selection methods not only effectively reduces the number of identified biomarkers, enhancing their biological relevance, but also corroborates the efficacy of drug response prediction models in cancer treatment.
{"title":"Evaluating Ovarian Cancer Chemotherapy Response Using Gene Expression Data and Machine Learning","authors":"Soukaina Amniouel, Keertana Yalamanchili, Sreenidhi Sankararaman, M. S. Jafri","doi":"10.3390/biomedinformatics4020077","DOIUrl":"https://doi.org/10.3390/biomedinformatics4020077","url":null,"abstract":"Background: Ovarian cancer (OC) is the most lethal gynecological cancer in the United States. Among the different types of OC, serous ovarian cancer (SOC) stands out as the most prevalent. Transcriptomics techniques generate extensive gene expression data, yet only a few of these genes are relevant to clinical diagnosis. Methods: Methods for feature selection (FS) address the challenges of high dimensionality in extensive datasets. This study proposes a computational framework that applies FS techniques to identify genes highly associated with platinum-based chemotherapy response on SOC patients. Using SOC datasets from the Gene Expression Omnibus (GEO) database, LASSO and varSelRF FS methods were employed. Machine learning classification algorithms such as random forest (RF) and support vector machine (SVM) were also used to evaluate the performance of the models. Results: The proposed framework has identified biomarkers panels with 9 and 10 genes that are highly correlated with platinum–paclitaxel and platinum-only response in SOC patients, respectively. The predictive models have been trained using the identified gene signatures and accuracy of above 90% was achieved. Conclusions: In this study, we propose that applying multiple feature selection methods not only effectively reduces the number of identified biomarkers, enhancing their biological relevance, but also corroborates the efficacy of drug response prediction models in cancer treatment.","PeriodicalId":72394,"journal":{"name":"BioMedInformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141112496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}