Pub Date : 2025-06-11eCollection Date: 2025-06-01DOI: 10.1515/jib-2024-0049
Jesús García-Salmerón, José Manuel García, Gregorio Bernabé, Pilar González-Férez
Accurate mitosis detection is essential for cancer diagnosis and treatment. Traditional manual counting by pathologists is time-consuming and may cause errors. This research investigates automated mitosis detection in stained histopathological images using Deep Learning (DL) techniques, particularly object detection models. We propose a two-stage object detection model based on Faster R-CNN to effectively detect mitosis within histopathological images. The stain augmentation and normalization techniques are also applied to address the significant challenge of domain shift in histopathological image analysis. The experiments are conducted using the MIDOG++ dataset, the most recent dataset from the MIDOG challenge. This research builds on our previous work, in which two one-stage frameworks, in particular on RetinaNet using fastai and PyTorch, are proposed. Our results indicate favorable F1-scores across various scenarios and tumor types, demonstrating the effectiveness of the object detection models. In addition, Faster R-CNN with stain techniques provides the most accurate and reliable mitosis detection, while RetinaNet models exhibit faster performance. Our results highlight the importance of handling domain shifts and the number of mitotic figures for robust diagnostic tools.
{"title":"Automated mitosis detection in stained histopathological images using Faster R-CNN and stain techniques.","authors":"Jesús García-Salmerón, José Manuel García, Gregorio Bernabé, Pilar González-Férez","doi":"10.1515/jib-2024-0049","DOIUrl":"10.1515/jib-2024-0049","url":null,"abstract":"<p><p>Accurate mitosis detection is essential for cancer diagnosis and treatment. Traditional manual counting by pathologists is time-consuming and may cause errors. This research investigates automated mitosis detection in stained histopathological images using Deep Learning (DL) techniques, particularly object detection models. We propose a two-stage object detection model based on Faster R-CNN to effectively detect mitosis within histopathological images. The stain augmentation and normalization techniques are also applied to address the significant challenge of domain shift in histopathological image analysis. The experiments are conducted using the MIDOG++ dataset, the most recent dataset from the MIDOG challenge. This research builds on our previous work, in which two one-stage frameworks, in particular on RetinaNet using fastai and PyTorch, are proposed. Our results indicate favorable F1-scores across various scenarios and tumor types, demonstrating the effectiveness of the object detection models. In addition, Faster R-CNN with stain techniques provides the most accurate and reliable mitosis detection, while RetinaNet models exhibit faster performance. Our results highlight the importance of handling domain shifts and the number of mitotic figures for robust diagnostic tools.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12569583/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144259406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-10eCollection Date: 2025-06-01DOI: 10.1515/jib-2024-0056
Anushka Chaurasia, Deepak Kumar, Yogita
Predicting Drug-Drug interaction (DDI)-induced adverse drug reactions (ADRs) using computational methods is challenging due to the availability of limited data samples, data sparsity, and high dimensionality. The issue of class imbalance further increases the intricacy of prediction. Different computational techniques have been presented for predicting DDI-induced ADRs in the general population; however, the area of DDI-induced pregnancy and neonatal ADRs has been underexplored. In the present work, a sparse ensemble-based computational approach is proposed that leverages SMILES strings as features, addresses high-dimensional and sparse data using Sparse Principal Component Analysis (SPCA), mitigates class imbalance with the Multilabel Synthetic Minority Oversampling Technique (MLSMOTE), and predicts pregnancy and neonatal ADRs through a stacking ensemble model. The SPCA has been evaluated for handling sparse data and shown 2.67 %-5.45 % improvement compared to PCA. The proposed stacking ensemble model has outperformed six state-of-the-art predictors regarding micro and macro scores for True Positive Rate (TPR), F1 Score, False Positive Rate (FPR), Precision, Hamming Loss, and ROC-AUC Score with 1.16 %-14.94 %.
{"title":"Predicting DDI-induced pregnancy and neonatal ADRs using sparse PCA and stacking ensemble approach.","authors":"Anushka Chaurasia, Deepak Kumar, Yogita","doi":"10.1515/jib-2024-0056","DOIUrl":"10.1515/jib-2024-0056","url":null,"abstract":"<p><p>Predicting Drug-Drug interaction (DDI)-induced adverse drug reactions (ADRs) using computational methods is challenging due to the availability of limited data samples, data sparsity, and high dimensionality. The issue of class imbalance further increases the intricacy of prediction. Different computational techniques have been presented for predicting DDI-induced ADRs in the general population; however, the area of DDI-induced pregnancy and neonatal ADRs has been underexplored. In the present work, a sparse ensemble-based computational approach is proposed that leverages SMILES strings as features, addresses high-dimensional and sparse data using Sparse Principal Component Analysis (SPCA), mitigates class imbalance with the Multilabel Synthetic Minority Oversampling Technique (MLSMOTE), and predicts pregnancy and neonatal ADRs through a stacking ensemble model. The SPCA has been evaluated for handling sparse data and shown 2.67 %-5.45 % improvement compared to PCA. The proposed stacking ensemble model has outperformed six state-of-the-art predictors regarding micro and macro scores for True Positive Rate (<i>TPR</i>), F1 Score, False Positive Rate (<i>FPR</i>), Precision, Hamming Loss, and ROC-AUC Score with 1.16 %-14.94 %.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12569586/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144250785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-04eCollection Date: 2025-06-01DOI: 10.1515/jib-2024-0048
Salvador de Haro, Gregorio Bernabé, José Manuel García, Pilar González-Férez
Left ventricular non-compaction is a cardiac condition marked by excessive trabeculae in the left ventricle's inner wall. Although various methods exist to measure these structures, the medical community still lacks consensus on the best approach. Previously, we developed DL-LVTQ, a tool based on a UNet neural network, to quantify trabeculae in this region. In this study, we expand the dataset to include new patients with Titin cardiomyopathy and healthy individuals with fewer trabeculae, requiring retraining of our models to enhance predictions. We also propose ViTUNeT, a neural network architecture combining U-Net and Vision Transformers to segment the left ventricle more accurately. Additionally, we train a YOLOv8 model to detect the ventricle and integrate it with ViTUNeT model to focus on the region of interest. Results from ViTUNet and YOLOv8 are similar to DL-LVTQ, suggesting dataset quality limits further accuracy improvements. To test this, we analyze MRI images and develop a method using two YOLOv8 models to identify and remove problematic images, leading to better results. Combining YOLOv8 with deep learning networks offers a promising approach for improving cardiac image analysis and segmentation.
{"title":"A ViTUNeT-based model using YOLOv8 for efficient LVNC diagnosis and automatic cleaning of dataset.","authors":"Salvador de Haro, Gregorio Bernabé, José Manuel García, Pilar González-Férez","doi":"10.1515/jib-2024-0048","DOIUrl":"10.1515/jib-2024-0048","url":null,"abstract":"<p><p>Left ventricular non-compaction is a cardiac condition marked by excessive trabeculae in the left ventricle's inner wall. Although various methods exist to measure these structures, the medical community still lacks consensus on the best approach. Previously, we developed DL-LVTQ, a tool based on a UNet neural network, to quantify trabeculae in this region. In this study, we expand the dataset to include new patients with Titin cardiomyopathy and healthy individuals with fewer trabeculae, requiring retraining of our models to enhance predictions. We also propose ViTUNeT, a neural network architecture combining U-Net and Vision Transformers to segment the left ventricle more accurately. Additionally, we train a YOLOv8 model to detect the ventricle and integrate it with ViTUNeT model to focus on the region of interest. Results from ViTUNet and YOLOv8 are similar to DL-LVTQ, suggesting dataset quality limits further accuracy improvements. To test this, we analyze MRI images and develop a method using two YOLOv8 models to identify and remove problematic images, leading to better results. Combining YOLOv8 with deep learning networks offers a promising approach for improving cardiac image analysis and segmentation.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12569573/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144217516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-03eCollection Date: 2025-03-01DOI: 10.1515/jib-2024-0043
Vladislav V Shilenok, Irina V Shilenok, Vladislav O Soldatov, Yuriy L Orlov, Ksenia A Kobzeva, Alexey V Deykin, Olga Yu Bushueva
Although multiple aspects of molecular pathology underlying cardiovascular diseases (CVDs) have been revealed, the complete picture has yet to be elucidated. In this respect, annotation of the novel links between genes and atherosclerosis is of great importance for cardiovascular medicine. Aligning with our previous research, we aimed to analyze the cardiovascular predisposition contribution of the genes encoding Hero-proteins, polypeptides with chaperone activity. Following bioinformatic sources were utilized to annotate data regarding the cardiovascular contribution of Hero-proteins and their genes: SNPinfo Web Server, The Cardiovascular Disease Knowledge Portal, GTEx Portal, HaploReg, rSNPBase, RegulomeDB, atSNP, Gene Ontology, QTLbase, and the Blood eQTL browser. Almost all analyzed genes were characterized by a very high regulatory potential of tag SNPs (except BEX3). Multiple substantial impacts of the analyzed SNPs on histone modifications, eQTL effects on CVD-related genes, and binding to transcription factors involved in biological processes pathogenetically significant for CVDs have been discovered. Here we provide in silico evidence of the involvement of genes C9orf16 (BBLN), C11orf58, SERBP1, SERF2, and C19orf53 in CVDs and their risk factors (high blood pressure, dyslipidemia, obesity, arrhythmias, etc.), thus revealing Hero-proteins as putative actors in the pathobiology of the heart and vessels.
{"title":"Bioinformatic analysis of the regulatory potential of tagging SNPs provides evidence of the involvement of genes encoding the heat-resistant obscure (Hero) proteins in the pathogenesis of cardiovascular diseases.","authors":"Vladislav V Shilenok, Irina V Shilenok, Vladislav O Soldatov, Yuriy L Orlov, Ksenia A Kobzeva, Alexey V Deykin, Olga Yu Bushueva","doi":"10.1515/jib-2024-0043","DOIUrl":"10.1515/jib-2024-0043","url":null,"abstract":"<p><p>Although multiple aspects of molecular pathology underlying cardiovascular diseases (CVDs) have been revealed, the complete picture has yet to be elucidated. In this respect, annotation of the novel links between genes and atherosclerosis is of great importance for cardiovascular medicine. Aligning with our previous research, we aimed to analyze the cardiovascular predisposition contribution of the genes encoding Hero-proteins, polypeptides with chaperone activity. Following bioinformatic sources were utilized to annotate data regarding the cardiovascular contribution of Hero-proteins and their genes: SNPinfo Web Server, The Cardiovascular Disease Knowledge Portal, GTEx Portal, HaploReg, rSNPBase, RegulomeDB, atSNP, Gene Ontology, QTLbase, and the Blood eQTL browser. Almost all analyzed genes were characterized by a very high regulatory potential of tag SNPs (except <i>BEX3</i>). Multiple substantial impacts of the analyzed SNPs on histone modifications, eQTL effects on CVD-related genes, and binding to transcription factors involved in biological processes pathogenetically significant for CVDs have been discovered. Here we provide <i>in silico</i> evidence of the involvement of genes <i>C9orf16 (BBLN)</i>, <i>C11orf58</i>, <i>SERBP1</i>, <i>SERF2</i>, and <i>C19orf53</i> in CVDs and their risk factors (high blood pressure, dyslipidemia, obesity, arrhythmias, etc.), thus revealing Hero-proteins as putative actors in the pathobiology of the heart and vessels.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12327200/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144210140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-03eCollection Date: 2025-06-01DOI: 10.1515/jib-2024-0053
Jose María Belmonte, Miguel Blanquer, Gregorio Bernabé, Fernando Jiménez, José Manuel García
This paper investigates the application of Survival Analysis (SA) techniques to forecast outcomes after autologous Hematopoietic Stem Cell Transplantation (aHSCT) for Multiple Myeloma (MM). By leveraging six SA models, we examine their predictive capabilities, measured through the Concordance Index (C-index) metric. Beyond evaluating model performance, we analyze feature importance using permutation and SHAP methods, highlighting key clinical factors such as treatment history, disease stage, and prior disease progression or relapse as critical predictors of survival. The findings suggest that while all models performed well based on the C-index, a detailed examination revealed variations in how each model processed data. Specifically, the Coxnet and Random Survival Forest models exhibited a more thorough use of clinical variables, whereas the gradient boosting models appeared to rely on a narrower range of features, potentially limiting their ability to differentiate between patients with comparable profiles. Risk predictions categorized patients into low, moderate, and high-risk levels. For lower-risk patients, the procedure showed positive outcomes, while higher-risk individuals were predicted to have limited survival benefits, recommending alternative treatments. Lastly, we propose future research to expand these models into time-to-event estimations, offering additional support for decision-making by predicting patient life expectancy post-transplant, considering their pre-transplant clinical attributes.
{"title":"Survival risk prediction in hematopoietic stem cell transplantation for multiple myeloma.","authors":"Jose María Belmonte, Miguel Blanquer, Gregorio Bernabé, Fernando Jiménez, José Manuel García","doi":"10.1515/jib-2024-0053","DOIUrl":"10.1515/jib-2024-0053","url":null,"abstract":"<p><p>This paper investigates the application of <i>Survival Analysis</i> (SA) techniques to forecast outcomes after <i>autologous Hematopoietic Stem Cell Transplantation</i> (aHSCT) for <i>Multiple Myeloma</i> (MM). By leveraging six SA models, we examine their predictive capabilities, measured through the <i>Concordance Index</i> (C-index) metric. Beyond evaluating model performance, we analyze feature importance using permutation and SHAP methods, highlighting key clinical factors such as treatment history, disease stage, and prior disease progression or relapse as critical predictors of survival. The findings suggest that while all models performed well based on the C-index, a detailed examination revealed variations in how each model processed data. Specifically, the Coxnet and Random Survival Forest models exhibited a more thorough use of clinical variables, whereas the gradient boosting models appeared to rely on a narrower range of features, potentially limiting their ability to differentiate between patients with comparable profiles. Risk predictions categorized patients into low, moderate, and high-risk levels. For lower-risk patients, the procedure showed positive outcomes, while higher-risk individuals were predicted to have limited survival benefits, recommending alternative treatments. Lastly, we propose future research to expand these models into time-to-event estimations, offering additional support for decision-making by predicting patient life expectancy post-transplant, considering their pre-transplant clinical attributes.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12569572/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144210142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-05-30eCollection Date: 2025-03-01DOI: 10.1515/jib-2025-0012
Danuta Schüler, Matthias Lange, Thomas Altmann, Maria Cuacos, Daniel Arend, John Charles D'Auria, Anne Fiebig, Jochen Kumlehn, Kerstin Neumann, Michael Melzer, Elena Rey-Mazón, Hardy Rolletschek, Uwe Scholz, Evelin Willner, Jochen C Reif
The Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben is a leading international plant science institute specializing in biodiversity and crop plant performance research. Over the last decade, all phases of the research data lifecycle were implemented as a continuous process in conjunction with information technology, standardization, and sustainable research data management (RDM) processes. Under the leadership of a team of data stewards, a research data infrastructure, process landscape, capacity building, and governance structures were successfully established. As a result, a generic research data infrastructure was created to serve the principles of good scientific practice, archiving research data in an accessible and sustainable manner, even before the FAIR criteria were formulated. In this paper, we discuss success stories as well as pitfalls and summarize the experiences from 15 years of operating a central RDM infrastructure. We present measures for agile requirements engineering, technical and organizational implementation, governance, training, and roll-out. We show the benefits of a participatory approach across all departments, personnel roles, and researcher profiles through pilot working groups and data management champions. As a result, an ambidextrous approach to data management was implemented, referring to the ability to efficiently combine operational needs, support daily tasks in compliance with the FAIR criteria, while remaining open to adopting technical innovations in an agile manner.
{"title":"Data management in balance - a decade of balancing pragmatism, sustainability and innovation at plant research center IPK Gatersleben.","authors":"Danuta Schüler, Matthias Lange, Thomas Altmann, Maria Cuacos, Daniel Arend, John Charles D'Auria, Anne Fiebig, Jochen Kumlehn, Kerstin Neumann, Michael Melzer, Elena Rey-Mazón, Hardy Rolletschek, Uwe Scholz, Evelin Willner, Jochen C Reif","doi":"10.1515/jib-2025-0012","DOIUrl":"10.1515/jib-2025-0012","url":null,"abstract":"<p><p>The Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben is a leading international plant science institute specializing in biodiversity and crop plant performance research. Over the last decade, all phases of the research data lifecycle were implemented as a continuous process in conjunction with information technology, standardization, and sustainable research data management (RDM) processes. Under the leadership of a team of data stewards, a research data infrastructure, process landscape, capacity building, and governance structures were successfully established. As a result, a generic research data infrastructure was created to serve the principles of good scientific practice, archiving research data in an accessible and sustainable manner, even before the FAIR criteria were formulated. In this paper, we discuss success stories as well as pitfalls and summarize the experiences from 15 years of operating a central RDM infrastructure. We present measures for agile requirements engineering, technical and organizational implementation, governance, training, and roll-out. We show the benefits of a participatory approach across all departments, personnel roles, and researcher profiles through pilot working groups and data management champions. As a result, an ambidextrous approach to data management was implemented, referring to the ability to efficiently combine operational needs, support daily tasks in compliance with the FAIR criteria, while remaining open to adopting technical innovations in an agile manner.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12327199/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144210141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-02eCollection Date: 2025-06-01DOI: 10.1515/jib-2024-0054
Alba Nogueira-Rodríguez, Daniel Glez-Peña, Cristina P Vieira, Jorge Vieira, Hugo López-Fernández
The characterization of protein-protein interactions (PPIs) is fundamental to understand cellular functions. Although machine learning methods in this task have historically reported prediction accuracies up to 95 %, including those only using raw protein sequences, it has been highlighted that this could be overestimated due to the use of random splits and metrics that do not take into account potential biases in the datasets. Here, we propose a per-protein utility metric, pp_MCC, able to show a drop in the performance in both random and unseen-protein splits scenarios. We tested ML models based on sequence embeddings. The pp_MCC metric evidences a reduced performance even in a random split, reaching levels similar to those shown by the raw MCC metric computed over an unseen protein split, and drops even further when the pp_MCC is used in an unseen protein split scenario. Thus, the metric is able to give a more realistic performance estimation while allowing to use random splits, which could be interesting for more protein-centric studies. Given the low adjusted performance obtained, there seems to be room for improvement when using only primary sequence information, suggesting the need of inclusion of complementary protein data, accompanied with the use of the pp_MCC metric.
{"title":"Towards a more accurate and reliable evaluation of machine learning protein-protein interaction prediction model performance in the presence of unavoidable dataset biases.","authors":"Alba Nogueira-Rodríguez, Daniel Glez-Peña, Cristina P Vieira, Jorge Vieira, Hugo López-Fernández","doi":"10.1515/jib-2024-0054","DOIUrl":"10.1515/jib-2024-0054","url":null,"abstract":"<p><p>The characterization of protein-protein interactions (PPIs) is fundamental to understand cellular functions. Although machine learning methods in this task have historically reported prediction accuracies up to 95 %, including those only using raw protein sequences, it has been highlighted that this could be overestimated due to the use of random splits and metrics that do not take into account potential biases in the datasets. Here, we propose a per-protein utility metric, pp_MCC, able to show a drop in the performance in both random and unseen-protein splits scenarios. We tested ML models based on sequence embeddings. The pp_MCC metric evidences a reduced performance even in a random split, reaching levels similar to those shown by the raw MCC metric computed over an unseen protein split, and drops even further when the pp_MCC is used in an unseen protein split scenario. Thus, the metric is able to give a more realistic performance estimation while allowing to use random splits, which could be interesting for more protein-centric studies. Given the low adjusted performance obtained, there seems to be room for improvement when using only primary sequence information, suggesting the need of inclusion of complementary protein data, accompanied with the use of the pp_MCC metric.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12569588/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143754930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-31eCollection Date: 2025-06-01DOI: 10.1515/jib-2024-0046
Daniel Pérez-Rodríguez, Roberto C Agís-Balboa, Hugo López-Fernández
Family structures play a crucial role in personal development, social dynamics, and mental health. Traditional systems for encoding genealogical data, such as Ahnentafel and the Register System, offer methods to document lineage but face limitations, particularly in accommodating horizontal relationships or handling changes in family datasets. Modern computational systems like LINKAGE and PED, while powerful for genetic analysis, lack human readability and are challenging to apply in fields where unstructured, narrative data is common, such as sociology or psychiatry. This paper aims to bridge this gap by enhancing Fcodes, a flexible and intuitive algorithm for encoding kinship relationships that is suited for both manual and computational use. Building on our previous work, we present improvements to the Fcodes core algorithm and command-line interface (CLI), as well as the development of F-Tree, a new graphical user interface (GUI) to streamline the encoding process. Additionally, we introduce a method for estimating the coefficient of inbreeding using Fcodes and explore the application of artificial intelligence, namely large language models (LLMs), to automatically infer family relationships from narrative text. These advancements highlight the potential of Fcodes in a wide range of research contexts, from social studies to genetics and mental health research.
{"title":"Fcodes update: a kinship encoding framework with F-Tree GUI & LLM inference.","authors":"Daniel Pérez-Rodríguez, Roberto C Agís-Balboa, Hugo López-Fernández","doi":"10.1515/jib-2024-0046","DOIUrl":"10.1515/jib-2024-0046","url":null,"abstract":"<p><p>Family structures play a crucial role in personal development, social dynamics, and mental health. Traditional systems for encoding genealogical data, such as Ahnentafel and the Register System, offer methods to document lineage but face limitations, particularly in accommodating horizontal relationships or handling changes in family datasets. Modern computational systems like LINKAGE and PED, while powerful for genetic analysis, lack human readability and are challenging to apply in fields where unstructured, narrative data is common, such as sociology or psychiatry. This paper aims to bridge this gap by enhancing Fcodes, a flexible and intuitive algorithm for encoding kinship relationships that is suited for both manual and computational use. Building on our previous work, we present improvements to the Fcodes core algorithm and command-line interface (CLI), as well as the development of F-Tree, a new graphical user interface (GUI) to streamline the encoding process. Additionally, we introduce a method for estimating the coefficient of inbreeding using Fcodes and explore the application of artificial intelligence, namely large language models (LLMs), to automatically infer family relationships from narrative text. These advancements highlight the potential of Fcodes in a wide range of research contexts, from social studies to genetics and mental health research.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12569577/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143744460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-20eCollection Date: 2025-06-01DOI: 10.1515/jib-2024-0050
João Capela, João Cheixo, Dick de Ridder, Oscar Dias, Miguel Rocha
Plants produce specialized metabolites, which play critical roles in defending against biotic and abiotic stresses. Due to their chemical diversity and bioactivity, these compounds have significant economic implications, particularly in the pharmaceutical and agrotechnology sectors. Despite their importance, the biosynthetic pathways of these metabolites remain largely unresolved. Automating the prediction of their precursors, derived from primary metabolism, is essential for accelerating pathway discovery. Using DeepMol's automated machine learning engine, we found that regularized linear classifiers offer optimal, accurate, and interpretable models for this task, outperforming state-of-the-art models while providing chemical insights into their predictions. The pipeline and models are available at the repository: https://github.com/jcapels/SMPrecursorPredictor.
{"title":"Predicting precursors of plant specialized metabolites using DeepMol automated machine learning.","authors":"João Capela, João Cheixo, Dick de Ridder, Oscar Dias, Miguel Rocha","doi":"10.1515/jib-2024-0050","DOIUrl":"10.1515/jib-2024-0050","url":null,"abstract":"<p><p>Plants produce specialized metabolites, which play critical roles in defending against biotic and abiotic stresses. Due to their chemical diversity and bioactivity, these compounds have significant economic implications, particularly in the pharmaceutical and agrotechnology sectors. Despite their importance, the biosynthetic pathways of these metabolites remain largely unresolved. Automating the prediction of their precursors, derived from primary metabolism, is essential for accelerating pathway discovery. Using DeepMol's automated machine learning engine, we found that regularized linear classifiers offer optimal, accurate, and interpretable models for this task, outperforming state-of-the-art models while providing chemical insights into their predictions. The pipeline and models are available at the repository: https://github.com/jcapels/SMPrecursorPredictor.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12569576/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143658772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-19eCollection Date: 2025-03-01DOI: 10.1515/jib-2023-0053
Zahra Mosalanejad, Seyed Nooreddin Faraji, Mohammad Reza Rahbar, Ahmad Gholami
The glycoproteins 41 (gp41) of human immunodeficiency virus (HIV), located on the virus's external surface, form six-helix bundles that facilitate viral entry into the host cell. Theta defensins, cyclic peptides, inhibit the formation of these bundles by binding to the GP41 CHR region. RC101, a synthetic analog of theta-defensin molecules, exhibits activity against various HIV subtypes. Molecular docking of the CHR and RC101 was done using MDockPeP and Hawdock server. The type of bonds and the essential amino acids in binding were identified using AlphaFold3, CHIMERA, RING, and CYTOSCAPE. Mutable amino acids within the peptide were determined using the CUPSAT and Duet. Thirty-two new peptides were designed, and their interaction with the CHR of the gp41 was analyzed. The physicochemical properties, toxicity, allergenicity, and antigenicity of peptides were also investigated. Most of the designed peptides exhibited higher binding affinities to the target compared to RC101; notably, peptides 1 and 4 had the highest binding affinity and demonstrated a greater percentage of interactions with critical amino acids of CHR. Peptides A and E displayed the best physiochemical properties among designed peptides. The designed peptides may present a new generation of anti-HIV drugs, which may reduce the likelihood of drug resistance.
{"title":"Designing an optimized theta-defensin peptide for HIV therapy using in-silico approaches.","authors":"Zahra Mosalanejad, Seyed Nooreddin Faraji, Mohammad Reza Rahbar, Ahmad Gholami","doi":"10.1515/jib-2023-0053","DOIUrl":"10.1515/jib-2023-0053","url":null,"abstract":"<p><p>The glycoproteins 41 (gp41) of human immunodeficiency virus (HIV), located on the virus's external surface, form six-helix bundles that facilitate viral entry into the host cell. Theta defensins, cyclic peptides, inhibit the formation of these bundles by binding to the GP41 CHR region. RC101, a synthetic analog of theta-defensin molecules, exhibits activity against various HIV subtypes. Molecular docking of the CHR and RC101 was done using MDockPeP and Hawdock server. The type of bonds and the essential amino acids in binding were identified using AlphaFold3, CHIMERA, RING, and CYTOSCAPE. Mutable amino acids within the peptide were determined using the CUPSAT and Duet. Thirty-two new peptides were designed, and their interaction with the CHR of the gp41 was analyzed. The physicochemical properties, toxicity, allergenicity, and antigenicity of peptides were also investigated. Most of the designed peptides exhibited higher binding affinities to the target compared to RC101; notably, peptides 1 and 4 had the highest binding affinity and demonstrated a greater percentage of interactions with critical amino acids of CHR. Peptides A and E displayed the best physiochemical properties among designed peptides. The designed peptides may present a new generation of anti-HIV drugs, which may reduce the likelihood of drug resistance.</p>","PeriodicalId":53625,"journal":{"name":"Journal of Integrative Bioinformatics","volume":" ","pages":""},"PeriodicalIF":1.8,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12327201/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143651943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}