Journal of Integrative Bioinformatics最新文献_第2页

Automated mitosis detection in stained histopathological images using Faster R-CNN and stain techniques. 使用Faster R-CNN和染色技术在染色的组织病理学图像中自动检测有丝分裂。

IF 1.8 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Journal of Integrative Bioinformatics

Pub Date : 2025-06-11 eCollection Date: 2025-06-01 DOI: 10.1515/jib-2024-0049

Jesús García-Salmerón, José Manuel García, Gregorio Bernabé, Pilar González-Férez

Accurate mitosis detection is essential for cancer diagnosis and treatment. Traditional manual counting by pathologists is time-consuming and may cause errors. This research investigates automated mitosis detection in stained histopathological images using Deep Learning (DL) techniques, particularly object detection models. We propose a two-stage object detection model based on Faster R-CNN to effectively detect mitosis within histopathological images. The stain augmentation and normalization techniques are also applied to address the significant challenge of domain shift in histopathological image analysis. The experiments are conducted using the MIDOG++ dataset, the most recent dataset from the MIDOG challenge. This research builds on our previous work, in which two one-stage frameworks, in particular on RetinaNet using fastai and PyTorch, are proposed. Our results indicate favorable F1-scores across various scenarios and tumor types, demonstrating the effectiveness of the object detection models. In addition, Faster R-CNN with stain techniques provides the most accurate and reliable mitosis detection, while RetinaNet models exhibit faster performance. Our results highlight the importance of handling domain shifts and the number of mitotic figures for robust diagnostic tools.

准确的有丝分裂检测对癌症的诊断和治疗至关重要。病理学家传统的手工计数既耗时又可能导致错误。本研究利用深度学习（DL）技术，特别是对象检测模型，研究染色组织病理学图像中有丝分裂的自动检测。我们提出了一种基于Faster R-CNN的两阶段目标检测模型，以有效检测组织病理图像中的有丝分裂。染色增强和归一化技术也被应用于解决组织病理图像分析领域转移的重大挑战。实验使用MIDOG++数据集进行，这是MIDOG挑战的最新数据集。本研究建立在我们之前的工作基础上，其中提出了两个单阶段框架，特别是使用fastai和PyTorch的RetinaNet框架。我们的研究结果表明，在各种场景和肿瘤类型中都有良好的f1得分，证明了目标检测模型的有效性。此外，更快的R-CNN与染色技术提供了最准确和可靠的有丝分裂检测，而RetinaNet模型表现出更快的性能。我们的研究结果强调了处理区域转移和有丝分裂图的数量对于稳健诊断工具的重要性。

{"title":"Automated mitosis detection in stained histopathological images using Faster R-CNN and stain techniques.","authors":"Jesús García-Salmerón, José Manuel García, Gregorio Bernabé, Pilar González-Férez","doi":"10.1515/jib-2024-0049","DOIUrl":"10.1515/jib-2024-0049","url":null,"abstract":"Accurate mitosis detection is essential for cancer diagnosis and treatment. Traditional manual counting by pathologists is time-consuming and may cause errors. This research investigates automated mitosis detection in stained histopathological images using Deep Learning (DL) techniques, particularly object detection models. We propose a two-stage object detection model based on Faster R-CNN to effectively detect mitosis within histopathological images. The stain augmentation and normalization techniques are also applied to address the significant challenge of domain shift in histopathological image analysis. The experiments are conducted using the MIDOG++ dataset, the most recent dataset from the MIDOG challenge. This research builds on our previous work, in which two one-stage frameworks, in particular on RetinaNet using fastai and PyTorch, are proposed. Our results indicate favorable F1-scores across various scenarios and tumor types, demonstrating the effectiveness of the object detection models. In addition, Faster R-CNN with stain techniques provides the most accurate and reliable mitosis detection, while RetinaNet models exhibit faster performance. Our results highlight the importance of handling domain shifts and the number of mitotic figures for robust diagnostic tools.","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12569583/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144259406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Predicting DDI-induced pregnancy and neonatal ADRs using sparse PCA and stacking ensemble approach. 稀疏PCA和叠加集合法预测ddi诱导妊娠和新生儿不良反应。

IF 1.8 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Journal of Integrative Bioinformatics

Pub Date : 2025-06-10 eCollection Date: 2025-06-01 DOI: 10.1515/jib-2024-0056

Anushka Chaurasia, Deepak Kumar, Yogita

Predicting Drug-Drug interaction (DDI)-induced adverse drug reactions (ADRs) using computational methods is challenging due to the availability of limited data samples, data sparsity, and high dimensionality. The issue of class imbalance further increases the intricacy of prediction. Different computational techniques have been presented for predicting DDI-induced ADRs in the general population; however, the area of DDI-induced pregnancy and neonatal ADRs has been underexplored. In the present work, a sparse ensemble-based computational approach is proposed that leverages SMILES strings as features, addresses high-dimensional and sparse data using Sparse Principal Component Analysis (SPCA), mitigates class imbalance with the Multilabel Synthetic Minority Oversampling Technique (MLSMOTE), and predicts pregnancy and neonatal ADRs through a stacking ensemble model. The SPCA has been evaluated for handling sparse data and shown 2.67 %-5.45 % improvement compared to PCA. The proposed stacking ensemble model has outperformed six state-of-the-art predictors regarding micro and macro scores for True Positive Rate (TPR), F1 Score, False Positive Rate (FPR), Precision, Hamming Loss, and ROC-AUC Score with 1.16 %-14.94 %.

由于数据样本有限，数据稀疏性和高维性，使用计算方法预测药物-药物相互作用（DDI）诱导的药物不良反应（adr）具有挑战性。类不平衡的问题进一步增加了预测的复杂性。已经提出了不同的计算技术来预测ddi在一般人群中引起的不良反应；然而，ddi诱导妊娠和新生儿不良反应的研究尚未充分。在本工作中，提出了一种基于稀疏集成的计算方法，该方法利用SMILES字符串作为特征，使用稀疏主成分分析（SPCA）处理高维和稀疏数据，使用多标签合成少数过采样技术（MLSMOTE）减轻类失衡，并通过堆叠集成模型预测妊娠和新生儿adr。SPCA在处理稀疏数据方面进行了评估，与PCA相比显示出2.67 %-5.45 %的改进。所提出的叠加集成模型在真阳性率（TPR）、F1分数、假阳性率（FPR）、精度、汉明损失和ROC-AUC分数的微观和宏观得分方面优于六个最先进的预测指标，得分为1.16 %-14.94 %。

{"title":"Predicting DDI-induced pregnancy and neonatal ADRs using sparse PCA and stacking ensemble approach.","authors":"Anushka Chaurasia, Deepak Kumar, Yogita","doi":"10.1515/jib-2024-0056","DOIUrl":"10.1515/jib-2024-0056","url":null,"abstract":"Predicting Drug-Drug interaction (DDI)-induced adverse drug reactions (ADRs) using computational methods is challenging due to the availability of limited data samples, data sparsity, and high dimensionality. The issue of class imbalance further increases the intricacy of prediction. Different computational techniques have been presented for predicting DDI-induced ADRs in the general population; however, the area of DDI-induced pregnancy and neonatal ADRs has been underexplored. In the present work, a sparse ensemble-based computational approach is proposed that leverages SMILES strings as features, addresses high-dimensional and sparse data using Sparse Principal Component Analysis (SPCA), mitigates class imbalance with the Multilabel Synthetic Minority Oversampling Technique (MLSMOTE), and predicts pregnancy and neonatal ADRs through a stacking ensemble model. The SPCA has been evaluated for handling sparse data and shown 2.67 %-5.45 % improvement compared to PCA. The proposed stacking ensemble model has outperformed six state-of-the-art predictors regarding micro and macro scores for True Positive Rate (TPR), F1 Score, False Positive Rate (FPR), Precision, Hamming Loss, and ROC-AUC Score with 1.16 %-14.94 %.","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12569586/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144250785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A ViTUNeT-based model using YOLOv8 for efficient LVNC diagnosis and automatic cleaning of dataset. 一个基于vitunet的模型，使用YOLOv8进行高效的LVNC诊断和数据集的自动清理。

IF 1.8 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Journal of Integrative Bioinformatics

Pub Date : 2025-06-04 eCollection Date: 2025-06-01 DOI: 10.1515/jib-2024-0048

Salvador de Haro, Gregorio Bernabé, José Manuel García, Pilar González-Férez

Left ventricular non-compaction is a cardiac condition marked by excessive trabeculae in the left ventricle's inner wall. Although various methods exist to measure these structures, the medical community still lacks consensus on the best approach. Previously, we developed DL-LVTQ, a tool based on a UNet neural network, to quantify trabeculae in this region. In this study, we expand the dataset to include new patients with Titin cardiomyopathy and healthy individuals with fewer trabeculae, requiring retraining of our models to enhance predictions. We also propose ViTUNeT, a neural network architecture combining U-Net and Vision Transformers to segment the left ventricle more accurately. Additionally, we train a YOLOv8 model to detect the ventricle and integrate it with ViTUNeT model to focus on the region of interest. Results from ViTUNet and YOLOv8 are similar to DL-LVTQ, suggesting dataset quality limits further accuracy improvements. To test this, we analyze MRI images and develop a method using two YOLOv8 models to identify and remove problematic images, leading to better results. Combining YOLOv8 with deep learning networks offers a promising approach for improving cardiac image analysis and segmentation.

左心室不压实是一种以左心室内壁小梁过多为特征的心脏疾病。尽管存在各种测量这些结构的方法，但医学界对最佳方法仍缺乏共识。此前，我们开发了基于UNet神经网络的DL-LVTQ工具，用于量化该区域的小梁。在这项研究中，我们扩展了数据集，包括新的Titin心肌病患者和小梁较少的健康个体，需要对我们的模型进行再训练以增强预测。我们还提出了ViTUNeT，一种结合U-Net和Vision transformer的神经网络架构，以更准确地分割左心室。此外，我们训练了一个YOLOv8模型来检测心室，并将其与ViTUNeT模型相结合，聚焦于感兴趣的区域。来自ViTUNet和YOLOv8的结果与DL-LVTQ相似，表明数据集质量限制了进一步的准确性提高。为了验证这一点，我们分析了MRI图像，并开发了一种方法，使用两个YOLOv8模型来识别和去除有问题的图像，从而获得更好的结果。将YOLOv8与深度学习网络相结合，为改善心脏图像分析和分割提供了有前途的方法。

{"title":"A ViTUNeT-based model using YOLOv8 for efficient LVNC diagnosis and automatic cleaning of dataset.","authors":"Salvador de Haro, Gregorio Bernabé, José Manuel García, Pilar González-Férez","doi":"10.1515/jib-2024-0048","DOIUrl":"10.1515/jib-2024-0048","url":null,"abstract":"Left ventricular non-compaction is a cardiac condition marked by excessive trabeculae in the left ventricle's inner wall. Although various methods exist to measure these structures, the medical community still lacks consensus on the best approach. Previously, we developed DL-LVTQ, a tool based on a UNet neural network, to quantify trabeculae in this region. In this study, we expand the dataset to include new patients with Titin cardiomyopathy and healthy individuals with fewer trabeculae, requiring retraining of our models to enhance predictions. We also propose ViTUNeT, a neural network architecture combining U-Net and Vision Transformers to segment the left ventricle more accurately. Additionally, we train a YOLOv8 model to detect the ventricle and integrate it with ViTUNeT model to focus on the region of interest. Results from ViTUNet and YOLOv8 are similar to DL-LVTQ, suggesting dataset quality limits further accuracy improvements. To test this, we analyze MRI images and develop a method using two YOLOv8 models to identify and remove problematic images, leading to better results. Combining YOLOv8 with deep learning networks offers a promising approach for improving cardiac image analysis and segmentation.","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12569573/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144217516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Bioinformatic analysis of the regulatory potential of tagging SNPs provides evidence of the involvement of genes encoding the heat-resistant obscure (Hero) proteins in the pathogenesis of cardiovascular diseases. 标记snp的调控潜力的生物信息学分析提供了编码耐热模糊（Hero）蛋白的基因参与心血管疾病发病机制的证据。

IF 1.8 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Journal of Integrative Bioinformatics

Pub Date : 2025-06-03 eCollection Date: 2025-03-01 DOI: 10.1515/jib-2024-0043

Vladislav V Shilenok, Irina V Shilenok, Vladislav O Soldatov, Yuriy L Orlov, Ksenia A Kobzeva, Alexey V Deykin, Olga Yu Bushueva

Although multiple aspects of molecular pathology underlying cardiovascular diseases (CVDs) have been revealed, the complete picture has yet to be elucidated. In this respect, annotation of the novel links between genes and atherosclerosis is of great importance for cardiovascular medicine. Aligning with our previous research, we aimed to analyze the cardiovascular predisposition contribution of the genes encoding Hero-proteins, polypeptides with chaperone activity. Following bioinformatic sources were utilized to annotate data regarding the cardiovascular contribution of Hero-proteins and their genes: SNPinfo Web Server, The Cardiovascular Disease Knowledge Portal, GTEx Portal, HaploReg, rSNPBase, RegulomeDB, atSNP, Gene Ontology, QTLbase, and the Blood eQTL browser. Almost all analyzed genes were characterized by a very high regulatory potential of tag SNPs (except BEX3). Multiple substantial impacts of the analyzed SNPs on histone modifications, eQTL effects on CVD-related genes, and binding to transcription factors involved in biological processes pathogenetically significant for CVDs have been discovered. Here we provide in silico evidence of the involvement of genes C9orf16 (BBLN), C11orf58, SERBP1, SERF2, and C19orf53 in CVDs and their risk factors (high blood pressure, dyslipidemia, obesity, arrhythmias, etc.), thus revealing Hero-proteins as putative actors in the pathobiology of the heart and vessels.

尽管心血管疾病（cvd）的分子病理学的多个方面已经被揭示，但完整的图景尚未被阐明。在这方面，阐明基因与动脉粥样硬化之间的新联系对心血管医学具有重要意义。根据我们之前的研究，我们旨在分析编码hero蛋白的基因对心血管易感性的贡献，这些蛋白是具有伴侣活性的多肽。我们利用以下生物信息源来注释hero蛋白及其基因对心血管的贡献数据：SNPinfo Web Server、the cardiovascular Disease Knowledge Portal、GTEx Portal、HaploReg、rSNPBase、RegulomeDB、atSNP、Gene Ontology、QTLbase和Blood eQTL浏览器。几乎所有分析的基因都具有非常高的标签snp调控潜力（BEX3除外）。已发现所分析的snp对组蛋白修饰、cvd相关基因的eQTL效应以及与cvd具有重要病理意义的生物学过程相关的转录因子结合的多重实质性影响。在这里，我们提供了基因C9orf16 （BBLN）、C11orf58、SERBP1、SERF2和C19orf53参与心血管疾病及其危险因素（高血压、血脂异常、肥胖、心律失常等）的计算机证据，从而揭示了hero蛋白在心脏和血管病理生物学中的推测作用。

{"title":"Bioinformatic analysis of the regulatory potential of tagging SNPs provides evidence of the involvement of genes encoding the heat-resistant obscure (Hero) proteins in the pathogenesis of cardiovascular diseases.","authors":"Vladislav V Shilenok, Irina V Shilenok, Vladislav O Soldatov, Yuriy L Orlov, Ksenia A Kobzeva, Alexey V Deykin, Olga Yu Bushueva","doi":"10.1515/jib-2024-0043","DOIUrl":"10.1515/jib-2024-0043","url":null,"abstract":"Although multiple aspects of molecular pathology underlying cardiovascular diseases (CVDs) have been revealed, the complete picture has yet to be elucidated. In this respect, annotation of the novel links between genes and atherosclerosis is of great importance for cardiovascular medicine. Aligning with our previous research, we aimed to analyze the cardiovascular predisposition contribution of the genes encoding Hero-proteins, polypeptides with chaperone activity. Following bioinformatic sources were utilized to annotate data regarding the cardiovascular contribution of Hero-proteins and their genes: SNPinfo Web Server, The Cardiovascular Disease Knowledge Portal, GTEx Portal, HaploReg, rSNPBase, RegulomeDB, atSNP, Gene Ontology, QTLbase, and the Blood eQTL browser. Almost all analyzed genes were characterized by a very high regulatory potential of tag SNPs (except BEX3). Multiple substantial impacts of the analyzed SNPs on histone modifications, eQTL effects on CVD-related genes, and binding to transcription factors involved in biological processes pathogenetically significant for CVDs have been discovered. Here we provide in silico evidence of the involvement of genes C9orf16 (BBLN), C11orf58, SERBP1, SERF2, and C19orf53 in CVDs and their risk factors (high blood pressure, dyslipidemia, obesity, arrhythmias, etc.), thus revealing Hero-proteins as putative actors in the pathobiology of the heart and vessels.","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12327200/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144210140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Survival risk prediction in hematopoietic stem cell transplantation for multiple myeloma. 造血干细胞移植治疗多发性骨髓瘤的生存风险预测。

IF 1.8 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Journal of Integrative Bioinformatics

Pub Date : 2025-06-03 eCollection Date: 2025-06-01 DOI: 10.1515/jib-2024-0053

Jose María Belmonte, Miguel Blanquer, Gregorio Bernabé, Fernando Jiménez, José Manuel García

This paper investigates the application of Survival Analysis (SA) techniques to forecast outcomes after autologous Hematopoietic Stem Cell Transplantation (aHSCT) for Multiple Myeloma (MM). By leveraging six SA models, we examine their predictive capabilities, measured through the Concordance Index (C-index) metric. Beyond evaluating model performance, we analyze feature importance using permutation and SHAP methods, highlighting key clinical factors such as treatment history, disease stage, and prior disease progression or relapse as critical predictors of survival. The findings suggest that while all models performed well based on the C-index, a detailed examination revealed variations in how each model processed data. Specifically, the Coxnet and Random Survival Forest models exhibited a more thorough use of clinical variables, whereas the gradient boosting models appeared to rely on a narrower range of features, potentially limiting their ability to differentiate between patients with comparable profiles. Risk predictions categorized patients into low, moderate, and high-risk levels. For lower-risk patients, the procedure showed positive outcomes, while higher-risk individuals were predicted to have limited survival benefits, recommending alternative treatments. Lastly, we propose future research to expand these models into time-to-event estimations, offering additional support for decision-making by predicting patient life expectancy post-transplant, considering their pre-transplant clinical attributes.

本文研究了生存分析（SA）技术在多发性骨髓瘤（MM）自体造血干细胞移植（aHSCT）后预后预测中的应用。通过利用六个SA模型，我们通过一致性指数（C-index）度量来检验它们的预测能力。除了评估模型的性能外，我们还使用置换和SHAP方法分析了特征的重要性，突出了关键的临床因素，如治疗史、疾病分期、既往疾病进展或复发，作为生存的关键预测因素。研究结果表明，尽管基于c指数的所有模型都表现良好，但详细的研究揭示了每个模型处理数据的方式存在差异。具体来说，Coxnet和Random Survival Forest模型更全面地使用了临床变量，而梯度增强模型似乎依赖于更窄的特征范围，这可能限制了它们区分具有可比较概况的患者的能力。风险预测将患者分为低、中、高风险级别。对于低风险患者，手术显示出积极的结果，而高风险患者预计生存益处有限，建议替代治疗。最后，我们建议未来的研究将这些模型扩展到时间到事件的估计，通过预测患者移植后的预期寿命，考虑他们移植前的临床属性，为决策提供额外的支持。

{"title":"Survival risk prediction in hematopoietic stem cell transplantation for multiple myeloma.","authors":"Jose María Belmonte, Miguel Blanquer, Gregorio Bernabé, Fernando Jiménez, José Manuel García","doi":"10.1515/jib-2024-0053","DOIUrl":"10.1515/jib-2024-0053","url":null,"abstract":"This paper investigates the application of Survival Analysis (SA) techniques to forecast outcomes after autologous Hematopoietic Stem Cell Transplantation (aHSCT) for Multiple Myeloma (MM). By leveraging six SA models, we examine their predictive capabilities, measured through the Concordance Index (C-index) metric. Beyond evaluating model performance, we analyze feature importance using permutation and SHAP methods, highlighting key clinical factors such as treatment history, disease stage, and prior disease progression or relapse as critical predictors of survival. The findings suggest that while all models performed well based on the C-index, a detailed examination revealed variations in how each model processed data. Specifically, the Coxnet and Random Survival Forest models exhibited a more thorough use of clinical variables, whereas the gradient boosting models appeared to rely on a narrower range of features, potentially limiting their ability to differentiate between patients with comparable profiles. Risk predictions categorized patients into low, moderate, and high-risk levels. For lower-risk patients, the procedure showed positive outcomes, while higher-risk individuals were predicted to have limited survival benefits, recommending alternative treatments. Lastly, we propose future research to expand these models into time-to-event estimations, offering additional support for decision-making by predicting patient life expectancy post-transplant, considering their pre-transplant clinical attributes.","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12569572/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144210142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Data management in balance - a decade of balancing pragmatism, sustainability and innovation at plant research center IPK Gatersleben. 平衡的数据管理- IPK Gatersleben工厂研究中心十年来平衡实用主义，可持续性和创新。

IF 1.8 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Journal of Integrative Bioinformatics

Pub Date : 2025-05-30 eCollection Date: 2025-03-01 DOI: 10.1515/jib-2025-0012

Danuta Schüler, Matthias Lange, Thomas Altmann, Maria Cuacos, Daniel Arend, John Charles D'Auria, Anne Fiebig, Jochen Kumlehn, Kerstin Neumann, Michael Melzer, Elena Rey-Mazón, Hardy Rolletschek, Uwe Scholz, Evelin Willner, Jochen C Reif

The Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben is a leading international plant science institute specializing in biodiversity and crop plant performance research. Over the last decade, all phases of the research data lifecycle were implemented as a continuous process in conjunction with information technology, standardization, and sustainable research data management (RDM) processes. Under the leadership of a team of data stewards, a research data infrastructure, process landscape, capacity building, and governance structures were successfully established. As a result, a generic research data infrastructure was created to serve the principles of good scientific practice, archiving research data in an accessible and sustainable manner, even before the FAIR criteria were formulated. In this paper, we discuss success stories as well as pitfalls and summarize the experiences from 15 years of operating a central RDM infrastructure. We present measures for agile requirements engineering, technical and organizational implementation, governance, training, and roll-out. We show the benefits of a participatory approach across all departments, personnel roles, and researcher profiles through pilot working groups and data management champions. As a result, an ambidextrous approach to data management was implemented, referring to the ability to efficiently combine operational needs, support daily tasks in compliance with the FAIR criteria, while remaining open to adopting technical innovations in an agile manner.

盖特斯莱本莱布尼茨植物遗传与作物研究所（IPK）是一家领先的国际植物科学研究所，专门从事生物多样性和作物植物性能研究。在过去的十年中，研究数据生命周期的所有阶段都与信息技术、标准化和可持续研究数据管理（RDM）过程相结合，作为一个连续的过程来实施。在数据管理员团队的领导下，成功建立了研究数据基础设施、流程环境、能力建设和治理结构。结果，创建了一个通用的研究数据基础设施，以服务于良好科学实践的原则，以可访问和可持续的方式存档研究数据，甚至在FAIR标准制定之前。在本文中，我们讨论了成功案例和缺陷，并总结了15年来运行中央RDM基础设施的经验。我们提出了敏捷需求工程、技术和组织实现、治理、培训和推出的度量。我们通过试点工作组和数据管理冠军展示了跨所有部门、人员角色和研究人员概况的参与式方法的好处。因此，实施了一种双灵巧的数据管理方法，指的是有效结合运营需求的能力，支持符合FAIR标准的日常任务，同时保持以敏捷方式采用技术创新的开放性。

{"title":"Data management in balance - a decade of balancing pragmatism, sustainability and innovation at plant research center IPK Gatersleben.","authors":"Danuta Schüler, Matthias Lange, Thomas Altmann, Maria Cuacos, Daniel Arend, John Charles D'Auria, Anne Fiebig, Jochen Kumlehn, Kerstin Neumann, Michael Melzer, Elena Rey-Mazón, Hardy Rolletschek, Uwe Scholz, Evelin Willner, Jochen C Reif","doi":"10.1515/jib-2025-0012","DOIUrl":"10.1515/jib-2025-0012","url":null,"abstract":"The Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben is a leading international plant science institute specializing in biodiversity and crop plant performance research. Over the last decade, all phases of the research data lifecycle were implemented as a continuous process in conjunction with information technology, standardization, and sustainable research data management (RDM) processes. Under the leadership of a team of data stewards, a research data infrastructure, process landscape, capacity building, and governance structures were successfully established. As a result, a generic research data infrastructure was created to serve the principles of good scientific practice, archiving research data in an accessible and sustainable manner, even before the FAIR criteria were formulated. In this paper, we discuss success stories as well as pitfalls and summarize the experiences from 15 years of operating a central RDM infrastructure. We present measures for agile requirements engineering, technical and organizational implementation, governance, training, and roll-out. We show the benefits of a participatory approach across all departments, personnel roles, and researcher profiles through pilot working groups and data management champions. As a result, an ambidextrous approach to data management was implemented, referring to the ability to efficiently combine operational needs, support daily tasks in compliance with the FAIR criteria, while remaining open to adopting technical innovations in an agile manner.","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12327199/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144210141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards a more accurate and reliable evaluation of machine learning protein-protein interaction prediction model performance in the presence of unavoidable dataset biases. 在不可避免的数据集偏差存在的情况下，对机器学习蛋白质-蛋白质相互作用预测模型的性能进行更准确和可靠的评估。

IF 1.8 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Journal of Integrative Bioinformatics

Pub Date : 2025-04-02 eCollection Date: 2025-06-01 DOI: 10.1515/jib-2024-0054

Alba Nogueira-Rodríguez, Daniel Glez-Peña, Cristina P Vieira, Jorge Vieira, Hugo López-Fernández

The characterization of protein-protein interactions (PPIs) is fundamental to understand cellular functions. Although machine learning methods in this task have historically reported prediction accuracies up to 95 %, including those only using raw protein sequences, it has been highlighted that this could be overestimated due to the use of random splits and metrics that do not take into account potential biases in the datasets. Here, we propose a per-protein utility metric, pp_MCC, able to show a drop in the performance in both random and unseen-protein splits scenarios. We tested ML models based on sequence embeddings. The pp_MCC metric evidences a reduced performance even in a random split, reaching levels similar to those shown by the raw MCC metric computed over an unseen protein split, and drops even further when the pp_MCC is used in an unseen protein split scenario. Thus, the metric is able to give a more realistic performance estimation while allowing to use random splits, which could be interesting for more protein-centric studies. Given the low adjusted performance obtained, there seems to be room for improvement when using only primary sequence information, suggesting the need of inclusion of complementary protein data, accompanied with the use of the pp_MCC metric.

蛋白质-蛋白质相互作用（PPIs）的表征是理解细胞功能的基础。尽管该任务中的机器学习方法历史上报告的预测精度高达95% %，包括那些仅使用原始蛋白质序列的方法，但由于使用随机分割和不考虑数据集中潜在偏差的指标，这可能被高估了。在这里，我们提出了一个每蛋白质效用度量，pp_MCC，能够显示在随机和不可见的蛋白质分裂情况下的性能下降。我们测试了基于序列嵌入的ML模型。即使在随机分裂中，pp_MCC指标也会降低性能，达到与在不可见的蛋白质分裂中计算的原始MCC指标所显示的水平相似，并且在不可见的蛋白质分裂场景中使用pp_MCC时性能会进一步下降。因此，在允许使用随机分割的同时，该度量能够给出更现实的性能估计，这对于更多以蛋白质为中心的研究可能会很有趣。考虑到所获得的低调整性能，当仅使用初级序列信息时似乎有改进的空间，这表明需要包括补充蛋白质数据，同时使用pp_MCC指标。

{"title":"Towards a more accurate and reliable evaluation of machine learning protein-protein interaction prediction model performance in the presence of unavoidable dataset biases.","authors":"Alba Nogueira-Rodríguez, Daniel Glez-Peña, Cristina P Vieira, Jorge Vieira, Hugo López-Fernández","doi":"10.1515/jib-2024-0054","DOIUrl":"10.1515/jib-2024-0054","url":null,"abstract":"The characterization of protein-protein interactions (PPIs) is fundamental to understand cellular functions. Although machine learning methods in this task have historically reported prediction accuracies up to 95 %, including those only using raw protein sequences, it has been highlighted that this could be overestimated due to the use of random splits and metrics that do not take into account potential biases in the datasets. Here, we propose a per-protein utility metric, pp_MCC, able to show a drop in the performance in both random and unseen-protein splits scenarios. We tested ML models based on sequence embeddings. The pp_MCC metric evidences a reduced performance even in a random split, reaching levels similar to those shown by the raw MCC metric computed over an unseen protein split, and drops even further when the pp_MCC is used in an unseen protein split scenario. Thus, the metric is able to give a more realistic performance estimation while allowing to use random splits, which could be interesting for more protein-centric studies. Given the low adjusted performance obtained, there seems to be room for improvement when using only primary sequence information, suggesting the need of inclusion of complementary protein data, accompanied with the use of the pp_MCC metric.","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12569588/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143754930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fcodes update: a kinship encoding framework with F-Tree GUI & LLM inference. Fcodes更新：具有F-Tree GUI和LLM推理的亲属关系编码框架。

IF 1.8 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Journal of Integrative Bioinformatics

Pub Date : 2025-03-31 eCollection Date: 2025-06-01 DOI: 10.1515/jib-2024-0046

Daniel Pérez-Rodríguez, Roberto C Agís-Balboa, Hugo López-Fernández

Family structures play a crucial role in personal development, social dynamics, and mental health. Traditional systems for encoding genealogical data, such as Ahnentafel and the Register System, offer methods to document lineage but face limitations, particularly in accommodating horizontal relationships or handling changes in family datasets. Modern computational systems like LINKAGE and PED, while powerful for genetic analysis, lack human readability and are challenging to apply in fields where unstructured, narrative data is common, such as sociology or psychiatry. This paper aims to bridge this gap by enhancing Fcodes, a flexible and intuitive algorithm for encoding kinship relationships that is suited for both manual and computational use. Building on our previous work, we present improvements to the Fcodes core algorithm and command-line interface (CLI), as well as the development of F-Tree, a new graphical user interface (GUI) to streamline the encoding process. Additionally, we introduce a method for estimating the coefficient of inbreeding using Fcodes and explore the application of artificial intelligence, namely large language models (LLMs), to automatically infer family relationships from narrative text. These advancements highlight the potential of Fcodes in a wide range of research contexts, from social studies to genetics and mental health research.

家庭结构在个人发展、社会动态和心理健康方面起着至关重要的作用。编码家谱数据的传统系统，如Ahnentafel和Register System，提供了记录血统的方法，但面临局限性，特别是在适应水平关系或处理家庭数据集的变化方面。像LINKAGE和PED这样的现代计算系统，虽然在基因分析方面功能强大，但缺乏人类的可读性，并且在社会学或精神病学等非结构化、叙述性数据普遍存在的领域应用起来具有挑战性。本文旨在通过增强Fcodes来弥合这一差距，Fcodes是一种灵活而直观的算法，用于编码亲属关系，适合手动和计算使用。在我们之前工作的基础上，我们提出了对Fcodes核心算法和命令行界面（CLI）的改进，以及F-Tree的开发，这是一个新的图形用户界面（GUI），以简化编码过程。此外，我们引入了一种使用Fcodes估计近交系数的方法，并探索了人工智能（即大语言模型（llm））在从叙事文本中自动推断家庭关系方面的应用。这些进步突出了Fcodes在广泛的研究背景下的潜力，从社会研究到遗传学和心理健康研究。

{"title":"Fcodes update: a kinship encoding framework with F-Tree GUI & LLM inference.","authors":"Daniel Pérez-Rodríguez, Roberto C Agís-Balboa, Hugo López-Fernández","doi":"10.1515/jib-2024-0046","DOIUrl":"10.1515/jib-2024-0046","url":null,"abstract":"Family structures play a crucial role in personal development, social dynamics, and mental health. Traditional systems for encoding genealogical data, such as Ahnentafel and the Register System, offer methods to document lineage but face limitations, particularly in accommodating horizontal relationships or handling changes in family datasets. Modern computational systems like LINKAGE and PED, while powerful for genetic analysis, lack human readability and are challenging to apply in fields where unstructured, narrative data is common, such as sociology or psychiatry. This paper aims to bridge this gap by enhancing Fcodes, a flexible and intuitive algorithm for encoding kinship relationships that is suited for both manual and computational use. Building on our previous work, we present improvements to the Fcodes core algorithm and command-line interface (CLI), as well as the development of F-Tree, a new graphical user interface (GUI) to streamline the encoding process. Additionally, we introduce a method for estimating the coefficient of inbreeding using Fcodes and explore the application of artificial intelligence, namely large language models (LLMs), to automatically infer family relationships from narrative text. These advancements highlight the potential of Fcodes in a wide range of research contexts, from social studies to genetics and mental health research.","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12569577/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143744460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Predicting precursors of plant specialized metabolites using DeepMol automated machine learning. 使用DeepMol自动机器学习预测植物专门代谢物的前体。

IF 1.8 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Journal of Integrative Bioinformatics

Pub Date : 2025-03-20 eCollection Date: 2025-06-01 DOI: 10.1515/jib-2024-0050

João Capela, João Cheixo, Dick de Ridder, Oscar Dias, Miguel Rocha

Plants produce specialized metabolites, which play critical roles in defending against biotic and abiotic stresses. Due to their chemical diversity and bioactivity, these compounds have significant economic implications, particularly in the pharmaceutical and agrotechnology sectors. Despite their importance, the biosynthetic pathways of these metabolites remain largely unresolved. Automating the prediction of their precursors, derived from primary metabolism, is essential for accelerating pathway discovery. Using DeepMol's automated machine learning engine, we found that regularized linear classifiers offer optimal, accurate, and interpretable models for this task, outperforming state-of-the-art models while providing chemical insights into their predictions. The pipeline and models are available at the repository: https://github.com/jcapels/SMPrecursorPredictor.

植物产生的特殊代谢物在抵御生物和非生物压力方面发挥着关键作用。由于其化学多样性和生物活性，这些化合物具有重要的经济意义，尤其是在制药和农业技术领域。尽管这些代谢物非常重要，但其生物合成途径在很大程度上仍未得到解决。自动预测来自初级代谢的前体对于加速途径发现至关重要。利用 DeepMol 的自动机器学习引擎，我们发现正则化线性分类器为这项任务提供了最佳、准确和可解释的模型，其性能优于最先进的模型，同时还能为其预测提供化学见解。管道和模型可在 https://github.com/jcapels/SMPrecursorPredictor 存储库中获取。

引用次数: 0

Designing an optimized theta-defensin peptide for HIV therapy using in-silico approaches. 设计一种优化的防御肽，用于使用计算机方法治疗HIV。

IF 1.8 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Journal of Integrative Bioinformatics

Pub Date : 2025-03-19 eCollection Date: 2025-03-01 DOI: 10.1515/jib-2023-0053

Zahra Mosalanejad, Seyed Nooreddin Faraji, Mohammad Reza Rahbar, Ahmad Gholami

The glycoproteins 41 (gp41) of human immunodeficiency virus (HIV), located on the virus's external surface, form six-helix bundles that facilitate viral entry into the host cell. Theta defensins, cyclic peptides, inhibit the formation of these bundles by binding to the GP41 CHR region. RC101, a synthetic analog of theta-defensin molecules, exhibits activity against various HIV subtypes. Molecular docking of the CHR and RC101 was done using MDockPeP and Hawdock server. The type of bonds and the essential amino acids in binding were identified using AlphaFold3, CHIMERA, RING, and CYTOSCAPE. Mutable amino acids within the peptide were determined using the CUPSAT and Duet. Thirty-two new peptides were designed, and their interaction with the CHR of the gp41 was analyzed. The physicochemical properties, toxicity, allergenicity, and antigenicity of peptides were also investigated. Most of the designed peptides exhibited higher binding affinities to the target compared to RC101; notably, peptides 1 and 4 had the highest binding affinity and demonstrated a greater percentage of interactions with critical amino acids of CHR. Peptides A and E displayed the best physiochemical properties among designed peptides. The designed peptides may present a new generation of anti-HIV drugs, which may reduce the likelihood of drug resistance.

人类免疫缺陷病毒（HIV）的糖蛋白41 （gp41）位于病毒的外表面，形成六螺旋束，促进病毒进入宿主细胞。Theta防御素，即环肽，通过与GP41 CHR区域结合来抑制这些束的形成。RC101是一种合成的防御素分子类似物，显示出对各种HIV亚型的活性。利用MDockPeP和Hawdock server对CHR和RC101进行分子对接。使用AlphaFold3、CHIMERA、RING和CYTOSCAPE鉴定了键的类型和结合的必需氨基酸。使用CUPSAT和Duet测定肽内的可变氨基酸。设计了32个新多肽，并分析了它们与gp41的CHR的相互作用。研究了多肽的理化性质、毒性、致敏性和抗原性。与RC101相比，大多数设计的肽与靶标的结合亲和力更高；值得注意的是，肽1和4具有最高的结合亲和力，并且与CHR关键氨基酸的相互作用百分比更高。多肽A和E在设计的多肽中表现出最好的理化性质。设计的多肽可能是新一代抗hiv药物，可能降低耐药的可能性。

{"title":"Designing an optimized theta-defensin peptide for HIV therapy using in-silico approaches.","authors":"Zahra Mosalanejad, Seyed Nooreddin Faraji, Mohammad Reza Rahbar, Ahmad Gholami","doi":"10.1515/jib-2023-0053","DOIUrl":"10.1515/jib-2023-0053","url":null,"abstract":"The glycoproteins 41 (gp41) of human immunodeficiency virus (HIV), located on the virus's external surface, form six-helix bundles that facilitate viral entry into the host cell. Theta defensins, cyclic peptides, inhibit the formation of these bundles by binding to the GP41 CHR region. RC101, a synthetic analog of theta-defensin molecules, exhibits activity against various HIV subtypes. Molecular docking of the CHR and RC101 was done using MDockPeP and Hawdock server. The type of bonds and the essential amino acids in binding were identified using AlphaFold3, CHIMERA, RING, and CYTOSCAPE. Mutable amino acids within the peptide were determined using the CUPSAT and Duet. Thirty-two new peptides were designed, and their interaction with the CHR of the gp41 was analyzed. The physicochemical properties, toxicity, allergenicity, and antigenicity of peptides were also investigated. Most of the designed peptides exhibited higher binding affinities to the target compared to RC101; notably, peptides 1 and 4 had the highest binding affinity and demonstrated a greater percentage of interactions with critical amino acids of CHR. Peptides A and E displayed the best physiochemical properties among designed peptides. The designed peptides may present a new generation of anti-HIV drugs, which may reduce the likelihood of drug resistance.","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12327201/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143651943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0