Background: Plant-derived exosome-like nanovesicles (P-ELNs) effectively deliver bioactive compounds due to their high biocompatibility and low immunogenicity. While liquid chromatography-mass spectrometry (LC-MS) profiles compounds in complex samples, its analysis of large datasets remains limited by traditional methods. Recent advances in large language models (LLMs) and domain-specific systems have enhanced Chinese biomedical data processing and cross-modal pharmaceutical research.
Objective: This study aimed to create a multimodal framework of LC-MS combined with DeepSeek models for data mining of compounds with wound-healing properties from exosome-like nanovesicles derived from Cayratia japonica (CJ-ELNs).
Methods: LC-MS identified compounds enriched in CJ (n=3) and CJ-ELNs (n=3), and then compounds specifically enriched in CJ-ELNs were filtered via a four-step filtering workflow. The CJ-ELNs-specific compounds were processed by DeepSeek models for screening naturally active compounds with targeted functions of antioxidation, anti-inflammation, anticellular damage, antiapoptosis, wound healing and tissue regeneration, and cell proliferation.
Results: A multimodal framework of LC-MS combined with the DeepSeek-DF model was created. With the assistance of artificial intelligence (AI), a total of 46 naturally active compounds derived from CJ-ELNs with targeted functions were identified.
Conclusions: A self-designed multimodal framework of LC-MS, combined with DeepSeek models, rapidly and accurately identifies naturally active compounds from CJ-ELNs. This AI-powered system innovatively integrates the traditional analytical technique with modern LLMs, thus greatly favoring data mining of active ingredients in traditional Chinese medicine herbs.
Background: The manual abstraction of unstructured clinical data is often necessary for granular clinical outcomes research but is time consuming and can be of variable quality. Large language models (LLMs) show promise in medical data extraction yet integrating them into research workflows remains challenging and poorly described.
Objective: This study aimed to develop and integrate an LLM-based system for automated data extraction from unstructured electronic health record (EHR) text reports within an established clinical outcomes database.
Methods: We implemented a generative artificial intelligence pipeline (UODBLLM) utilizing a flexible language model interface that supports various LLM implementations, including Health Insurance Portability and Accountability Act-compliant cloud services and local open-source models. We used extensible markup language (XML)-structured prompts and integrated using an open database connectivity interface to generate structured data from clinical documentation in the EHR. We evaluated the UODBLLM's performance on the completion rate, processing time, and extraction capabilities across multiple clinical data elements, including quantitative measurements, categorical assessments, and anatomical descriptions, using sample magnetic resonance imaging (MRI) reports as test cases. System reliability was tested across multiple batches to assess scalability and consistency.
Results: Piloted against MRI reports, UODBLLM processed 1800 clinical documents with a 100% completion rate and an average processing time of 8.90 seconds per report. The token utilization averaged 2692 tokens per report, with an input-to-output ratio of approximately 13:2, resulting in a processing cost of US $0.009 per report. UODBLLM had consistent performance across 18 batches of 100 reports each and completed all processing in 4.45 hours. From each report, UODBLLM extracted 16 structured clinical elements, including prostate volume, prostate-specific antigen values, Prostate Imaging Reporting and Data System scores, clinical staging, and anatomical assessments. All extracted data were automatically validated against predefined schemas and stored in standardized JSON format.
Conclusions: We demonstrated the successful integration of an LLM-based extraction system within an existing clinical outcomes database, achieving rapid, comprehensive data extraction at minimal cost. UODBLLM provides a scalable, efficient solution for automating clinical data extraction while maintaining protected health information security. This approach could significantly accelerate research timelines and expand feasible clinical studies, particularly for large-scale database projects.
Background: The COVID-19 pandemic requires a deep understanding of SARS-CoV-2, particularly how mutations in the spike receptor-binding domain (RBD) chain E affect its structure and function. Current methods lack comprehensive analysis of these mutations at different structural levels.
Objective: This study aims to analyze the impact of specific COVID-19-associated point mutations (N501Y, L452R, N440K, K417N, and E484A) on the SARS-CoV-2 spike RBD structure and function using predictive modeling, including a graph-theoretic model, protein modeling techniques, and molecular dynamics simulations.
Methods: The study used a multitiered graph-theoretic framework to represent protein structure across 3 interconnected levels. This model incorporated 19 top-level vertices, connected to intermediate graphs based on 6-angstrom proximity within the protein's 3D structure. Graph-theoretic molecular descriptors or invariants were applied to weigh vertices and edges at all levels. The study also used Iterative Threading Assembly Refinement (I-TASSER) to model mutated sequences and molecular dynamics simulation tools to evaluate changes in protein folding and stability compared to the wildtype.
Results: A total of 3 distinct predictive modeling and analytical approaches successfully identified structural and functional changes in the SARS-CoV-2 spike RBD (chain E) resulting from point mutations. The novel graph-theoretic model detected notable structural changes, with N501Y and L452R showing the most pronounced effects on conformation and stability compared to the wildtype. K147N and E484A mutations demonstrated less significant impacts compared to the severe mutations, N501Y and L452R. Ab initio modeling and molecular simulation dynamics findings corroborated the results from graph-theoretic analysis. The multilevel analytical approach provided a comprehensive visualization of mutation effects, deepening our understanding of their functional consequences.
Conclusions: This study advanced our understanding of SARS-CoV-2 spike RBD mutations and their implications. The multifaceted approach characterized the effects of various mutations, identifying N501Y and L452R as having the most substantial impact on RBD conformation and stability. The findings have important implications for vaccine development, therapeutic design, and variant monitoring. Our research underscores the power of combining multiple predictive analytical approaches in virology, contributing valuable knowledge to ongoing efforts against the COVID-19 pandemic and providing a framework for future studies on viral mutations and their impacts on protein structure and function.
Background: Adalimumab, a monoclonal antibody targeting tumor necrosis factor α, treats autoimmune diseases but induces antidrug antibodies in 30% to 60% of patients, reducing its efficacy.
Objective: This study aims to investigate molecular mimicry as a mechanism behind this immunogenicity, where bacterial immunoglobulin domains structurally resemble adalimumab's light chain, triggering immune responses.
Methods: Using PSI-BLASTp (National Center for Biotechnology Information) and PRALINE (Center for Integrative Bioinformatics), there are 40 bacterial antigens homologous to adalimumab, with 8 clinically relevant strains.
Results: Structural analysis revealed 94% amino acid identity between the immunoglobulin domain of Escherichia coli strain B1 and adalimumab's light chain, and 89.67% similarity with Corynebacterium pyruviciproducens. Root mean square deviation values confirmed strong structural homology. Additionally, 5 cross-reactive B-cell epitopes were predicted, suggesting overlapping surfaces that may promote immune cross-reactivity and antidrug antibody development.
Conclusions: This study represents a first step toward identifying a potential microbial factor driving antiadalimumab antibody formation. The predicted cross-reactive regions provide specific candidates for further in vitro validation to confirm molecular mimicry and refine epitope mapping. Understanding these mechanisms may ultimately inform the design of less immunogenic biologics and guide clinical strategies to predict and prevent antidrug antibody formation.
Background: Bladder cancer is a disease characterized by complex perturbations in gene networks and is heterogeneous in terms of histology, mutations, and prognosis. Advances in high-throughput sequencing technologies, genome-wide association studies, and bioinformatics methods have revealed greater insights into the pathogenesis of complex diseases. Network biology-based approaches have been used to identify complex protein-protein interactions (PPIs) that can lead to potential drug targets. There is a need to better understand PPIs specific to urothelial carcinoma.
Objective: This study aimed to elucidate PPIs specific to papillary and nonpapillary urothelial carcinoma and identify the most connected or "hub" proteins, as these are potential drug targets.
Methods: A novel PPI analysis tool, Proteinarium, was used to analyze RNA sequencing data from 132 patients with papillary and 270 patients with nonpapillary urothelial carcinoma from the TCGA Cell 2017 dataset and 39 patients with papillary and 88 patients with nonpapillary urothelial carcinoma from the TCGA Nature 2014 dataset. Hub proteins were identified in distinct PPI networks specific to papillary and nonpapillary urothelial carcinoma. Statistical significance of clusters was assessed using the Fisher exact test (P<.001), and network separation was quantified using the interactome-based separation score.
Results: RPS27A, UBA52, and VAMP8 were the most connected or "hub" proteins identified in the network specific to the papillary urothelial carcinoma. In the network specific to the nonpapillary carcinoma, GNB1, RHOA, UBC, and FPR2 were found to be the hub proteins. Notably, GNB1 and FPR2 were among the proteins that have existing drugs targeting them.
Conclusions: We identified distinct PPI networks and the hub proteins specific to papillary and nonpapillary urothelial carcinomas. However, these findings are limited by the use of transcriptomic data and require experimental validation to confirm the functional relevance of the identified targets.
Background: Integrating clinical, genomic, and social determinants of health (SDOH) data is essential for advancing precision medicine and addressing cancer health disparities. However, existing bioinformatics tools often lack the flexibility to perform equity-driven analyses or require significant programming expertise.
Objective: We developed AI-HOPE-PM (Artificial Intelligence Agent for High-Optimization and Precision Medicine in Population Metrics), a conversational artificial intelligence system designed to enable natural language-driven, multidimensional cancer analysis. This study describes the development, implementation, and application of AI-HOPE-PM to support hypothesis testing that integrates genomic, clinical, and SDOH data.
Methods: AI-HOPE-PM leverages large language models and Python-based statistical scripts to convert user-defined natural language queries into executable workflows. It was evaluated using curated colorectal cancer datasets from The Cancer Genome Atlas and cBioPortal, enriched with harmonized SDOH variables. Accuracy of natural language interpretation, run time efficiency, and usability were benchmarked against cBioPortal and UCSC Xena.
Results: AI-HOPE-PM successfully supported case-control stratification, survival modeling, and odds ratio analysis using natural language prompts. In colorectal cancer case studies, the system revealed significant disparities in progression-free survival and treatment access based on financial strain, health care access, food insecurity, and social support, demonstrating the importance of integrating SDOH in cancer research. Benchmark testing showed faster task execution compared to existing platforms, and the system achieved 92.5% accuracy in parsing biomedical queries.
Conclusions: AI-HOPE-PM lowers technical barriers to integrative cancer research by enabling real-time, user-friendly exploration of clinical, genomic, and SDOH data. It expands on prior work by incorporating equity metrics into precision oncology workflows and offers a scalable tool for supporting disparities-focused translational research. Five videos are included as multimedia appendices to demonstrate platform functionality in real-world scenarios.

