Preparing to rapidly respond to emerging infectious diseases is becoming ever more critical. "SpillOver: Viral Risk Ranking" is an open-source tool developed to evaluate novel wildlife-origin viruses for their risk of spillover from animals to humans and their risk of spreading in human populations. However, several of the factors used in the risk assessment are dependent on evidence of previous zoonotic spillover and/or sustained transmission in humans. Therefore, we performed a reanalysis of the "Ranking Comparison" after removing eight factors that require post-spillover knowledge and compared the adjusted risk rankings to the originals. The top 10 viruses as ranked by their adjusted scores also had very high original scores. However, the predictive power of the tool for whether a virus was a human virus or not as classified in the Spillover database deteriorated when these eight factors were removed. The area under the receiver operating characteristic curves (AUROC) for the original score, 0.94, decreased to 0.73 for the adjusted scores. Furthermore, we compared the mean and standard deviation of the human and non-human viruses at the factor level. Most of the excluded spillover-dependent factors had dissimilar means between the human and non-human virus groups compared to the non-spillover dependent factors, which frequently demonstrated similar means between the two groups with some exceptions. We concluded that the original formulation of the tool depended heavily on spillover-dependent factors to "predict" the risk of zoonotic spillover for a novel virus. Future iterations of the tool should take into consideration other non-spillover dependent factors and omit those that are spillover-dependent to ensure the tool is fit for purpose.
{"title":"Focusing Viral Risk Ranking Tool on Prediction","authors":"Katherine Budeski, Marc Lipsitch","doi":"arxiv-2409.04932","DOIUrl":"https://doi.org/arxiv-2409.04932","url":null,"abstract":"Preparing to rapidly respond to emerging infectious diseases is becoming ever\u0000more critical. \"SpillOver: Viral Risk Ranking\" is an open-source tool developed\u0000to evaluate novel wildlife-origin viruses for their risk of spillover from\u0000animals to humans and their risk of spreading in human populations. However,\u0000several of the factors used in the risk assessment are dependent on evidence of\u0000previous zoonotic spillover and/or sustained transmission in humans. Therefore,\u0000we performed a reanalysis of the \"Ranking Comparison\" after removing eight\u0000factors that require post-spillover knowledge and compared the adjusted risk\u0000rankings to the originals. The top 10 viruses as ranked by their adjusted\u0000scores also had very high original scores. However, the predictive power of the\u0000tool for whether a virus was a human virus or not as classified in the\u0000Spillover database deteriorated when these eight factors were removed. The area\u0000under the receiver operating characteristic curves (AUROC) for the original\u0000score, 0.94, decreased to 0.73 for the adjusted scores. Furthermore, we\u0000compared the mean and standard deviation of the human and non-human viruses at\u0000the factor level. Most of the excluded spillover-dependent factors had\u0000dissimilar means between the human and non-human virus groups compared to the\u0000non-spillover dependent factors, which frequently demonstrated similar means\u0000between the two groups with some exceptions. We concluded that the original\u0000formulation of the tool depended heavily on spillover-dependent factors to\u0000\"predict\" the risk of zoonotic spillover for a novel virus. Future iterations\u0000of the tool should take into consideration other non-spillover dependent\u0000factors and omit those that are spillover-dependent to ensure the tool is fit\u0000for purpose.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142213280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the realm of drug discovery, DNA-encoded library (DEL) screening technology has emerged as an efficient method for identifying high-affinity compounds. However, DEL screening faces a significant challenge: noise arising from nonspecific interactions within complex biological systems. Neural networks trained on DEL libraries have been employed to extract compound features, aiming to denoise the data and uncover potential binders to the desired therapeutic target. Nevertheless, the inherent structure of DEL, constrained by the limited diversity of building blocks, impacts the performance of compound encoders. Moreover, existing methods only capture compound features at a single level, further limiting the effectiveness of the denoising strategy. To mitigate these issues, we propose a Multimodal Pretraining DEL-Fusion model (MPDF) that enhances encoder capabilities through pretraining and integrates compound features across various scales. We develop pretraining tasks applying contrastive objectives between different compound representations and their text descriptions, enhancing the compound encoders' ability to acquire generic features. Furthermore, we propose a novel DEL-fusion framework that amalgamates compound information at the atomic, submolecular, and molecular levels, as captured by various compound encoders. The synergy of these innovations equips MPDF with enriched, multi-scale features, enabling comprehensive downstream denoising. Evaluated on three DEL datasets, MPDF demonstrates superior performance in data processing and analysis for validation tasks. Notably, MPDF offers novel insights into identifying high-affinity molecules, paving the way for improved DEL utility in drug discovery.
在药物发现领域,DNA编码文库(DEL)筛选技术已成为鉴定高亲和力化合物的有效方法。然而,DEL 筛选面临着一个重大挑战:复杂生物系统中的非特异性相互作用所产生的噪音。在 DEL 库上训练的神经网络已被用于提取化合物特征,目的是对数据进行去噪处理,并发现与所需治疗靶点的潜在结合体。然而,DEL 的固有结构受到构建模块多样性有限的限制,影响了化合物编码器的性能。此外,现有方法只能捕捉单层次的化合物特征,进一步限制了去噪策略的有效性。为了缓解这些问题,我们提出了一种多模态预训练 DEL-Fusion 模型(MPDF),通过预训练来增强编码器的能力,并整合不同尺度的复合特征。我们开发了在不同的复合表述及其文本描述之间应用对比目标的训练任务,从而增强了复合编码器获取通用特征的能力。此外,我们还提出了一种新颖的 DEL 融合框架,该框架可将各种化合物编码器捕捉到的原子、亚分子和分子层面的化合物信息融合在一起。这些创新的协同作用为 MPDF 提供了丰富的多尺度特征,从而实现了全面的下游去噪。在三个 DEL 数据集上进行的评估表明,MPDF 在数据处理和分析验证任务中表现出卓越的性能。值得注意的是,MPDF 为识别高亲和力分子提供了新的见解,为提高 DEL 在药物发现中的实用性铺平了道路。
{"title":"Unlocking Potential Binders: Multimodal Pretraining DEL-Fusion for Denoising DNA-Encoded Libraries","authors":"Chunbin Gu, Mutian He, Hanqun Cao, Guangyong Chen, Chang-yu Hsieh, Pheng Ann Heng","doi":"arxiv-2409.05916","DOIUrl":"https://doi.org/arxiv-2409.05916","url":null,"abstract":"In the realm of drug discovery, DNA-encoded library (DEL) screening\u0000technology has emerged as an efficient method for identifying high-affinity\u0000compounds. However, DEL screening faces a significant challenge: noise arising\u0000from nonspecific interactions within complex biological systems. Neural\u0000networks trained on DEL libraries have been employed to extract compound\u0000features, aiming to denoise the data and uncover potential binders to the\u0000desired therapeutic target. Nevertheless, the inherent structure of DEL,\u0000constrained by the limited diversity of building blocks, impacts the\u0000performance of compound encoders. Moreover, existing methods only capture\u0000compound features at a single level, further limiting the effectiveness of the\u0000denoising strategy. To mitigate these issues, we propose a Multimodal\u0000Pretraining DEL-Fusion model (MPDF) that enhances encoder capabilities through\u0000pretraining and integrates compound features across various scales. We develop\u0000pretraining tasks applying contrastive objectives between different compound\u0000representations and their text descriptions, enhancing the compound encoders'\u0000ability to acquire generic features. Furthermore, we propose a novel DEL-fusion\u0000framework that amalgamates compound information at the atomic, submolecular,\u0000and molecular levels, as captured by various compound encoders. The synergy of\u0000these innovations equips MPDF with enriched, multi-scale features, enabling\u0000comprehensive downstream denoising. Evaluated on three DEL datasets, MPDF\u0000demonstrates superior performance in data processing and analysis for\u0000validation tasks. Notably, MPDF offers novel insights into identifying\u0000high-affinity molecules, paving the way for improved DEL utility in drug\u0000discovery.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142213268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yizhen Zheng, Huan Yee Koh, Maddie Yang, Li Li, Lauren T. May, Geoffrey I. Webb, Shirui Pan, George Church
The integration of Large Language Models (LLMs) into the drug discovery and development field marks a significant paradigm shift, offering novel methodologies for understanding disease mechanisms, facilitating drug discovery, and optimizing clinical trial processes. This review highlights the expanding role of LLMs in revolutionizing various stages of the drug development pipeline. We investigate how these advanced computational models can uncover target-disease linkage, interpret complex biomedical data, enhance drug molecule design, predict drug efficacy and safety profiles, and facilitate clinical trial processes. Our paper aims to provide a comprehensive overview for researchers and practitioners in computational biology, pharmacology, and AI4Science by offering insights into the potential transformative impact of LLMs on drug discovery and development.
{"title":"Large Language Models in Drug Discovery and Development: From Disease Mechanisms to Clinical Trials","authors":"Yizhen Zheng, Huan Yee Koh, Maddie Yang, Li Li, Lauren T. May, Geoffrey I. Webb, Shirui Pan, George Church","doi":"arxiv-2409.04481","DOIUrl":"https://doi.org/arxiv-2409.04481","url":null,"abstract":"The integration of Large Language Models (LLMs) into the drug discovery and\u0000development field marks a significant paradigm shift, offering novel\u0000methodologies for understanding disease mechanisms, facilitating drug\u0000discovery, and optimizing clinical trial processes. This review highlights the\u0000expanding role of LLMs in revolutionizing various stages of the drug\u0000development pipeline. We investigate how these advanced computational models\u0000can uncover target-disease linkage, interpret complex biomedical data, enhance\u0000drug molecule design, predict drug efficacy and safety profiles, and facilitate\u0000clinical trial processes. Our paper aims to provide a comprehensive overview\u0000for researchers and practitioners in computational biology, pharmacology, and\u0000AI4Science by offering insights into the potential transformative impact of\u0000LLMs on drug discovery and development.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142213295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Different loading modes can significantly affect human gait, posture, and lower limb biomechanics. This study investigated the muscle activity intensity of the lower limb soleus muscle in the slope environment of young healthy adult male subjects under unilateral loading environment. Ten subjects held dumbbells equal to 5% and 10% of their body weight (BW) and walked at a fixed speed on a slope of 5 degree and 10 degree, respectively. The changes of electromyography (EMG) of bilateral soleus muscles of the lower limbs were recorded. Experiments were performed using one-way analysis of variance (ANOVA) and multivariate analysis of variance (MANOVA) to examine the relationship between load weight, slope angle, and muscle activity intensity. The data provided by this research can help to promote the development of the field of lower limb assist exoskeleton. The research results fill the missing data when loading on the slope side, provide data support for future assistance systems, and promote the formation of relevant data sets, so as to improve the terrain recognition ability and the movement ability of the device wearer.
{"title":"The Effects of Unilateral Slope Loading on Lower Limb Plantar Flexor Muscle EMG Signals in Young Healthy Males","authors":"Xinyu Zhou, Gengshang Dong, Pengxuan Zhang","doi":"arxiv-2409.04321","DOIUrl":"https://doi.org/arxiv-2409.04321","url":null,"abstract":"Different loading modes can significantly affect human gait, posture, and\u0000lower limb biomechanics. This study investigated the muscle activity intensity\u0000of the lower limb soleus muscle in the slope environment of young healthy adult\u0000male subjects under unilateral loading environment. Ten subjects held dumbbells\u0000equal to 5% and 10% of their body weight (BW) and walked at a fixed speed on a\u0000slope of 5 degree and 10 degree, respectively. The changes of electromyography\u0000(EMG) of bilateral soleus muscles of the lower limbs were recorded. Experiments\u0000were performed using one-way analysis of variance (ANOVA) and multivariate\u0000analysis of variance (MANOVA) to examine the relationship between load weight,\u0000slope angle, and muscle activity intensity. The data provided by this research\u0000can help to promote the development of the field of lower limb assist\u0000exoskeleton. The research results fill the missing data when loading on the\u0000slope side, provide data support for future assistance systems, and promote the\u0000formation of relevant data sets, so as to improve the terrain recognition\u0000ability and the movement ability of the device wearer.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142213290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alberto Cattaneo, Stephen Bonner, Thomas Martynec, Carlo Luschi, Ian P Barrett, Daniel Justus
Knowledge Graph Completion has been increasingly adopted as a useful method for several tasks in biomedical research, like drug repurposing or drug-target identification. To that end, a variety of datasets and Knowledge Graph Embedding models has been proposed over the years. However, little is known about the properties that render a dataset useful for a given task and, even though theoretical properties of Knowledge Graph Embedding models are well understood, their practical utility in this field remains controversial. We conduct a comprehensive investigation into the topological properties of publicly available biomedical Knowledge Graphs and establish links to the accuracy observed in real-world applications. By releasing all model predictions and a new suite of analysis tools we invite the community to build upon our work and continue improving the understanding of these crucial applications.
{"title":"The Role of Graph Topology in the Performance of Biomedical Knowledge Graph Completion Models","authors":"Alberto Cattaneo, Stephen Bonner, Thomas Martynec, Carlo Luschi, Ian P Barrett, Daniel Justus","doi":"arxiv-2409.04103","DOIUrl":"https://doi.org/arxiv-2409.04103","url":null,"abstract":"Knowledge Graph Completion has been increasingly adopted as a useful method\u0000for several tasks in biomedical research, like drug repurposing or drug-target\u0000identification. To that end, a variety of datasets and Knowledge Graph\u0000Embedding models has been proposed over the years. However, little is known\u0000about the properties that render a dataset useful for a given task and, even\u0000though theoretical properties of Knowledge Graph Embedding models are well\u0000understood, their practical utility in this field remains controversial. We\u0000conduct a comprehensive investigation into the topological properties of\u0000publicly available biomedical Knowledge Graphs and establish links to the\u0000accuracy observed in real-world applications. By releasing all model\u0000predictions and a new suite of analysis tools we invite the community to build\u0000upon our work and continue improving the understanding of these crucial\u0000applications.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142213291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Host-pathogen interactions consist of an attack by the pathogen, frequently a defense by the host and possibly a counter-defense by the pathogen. Here, we present a game-theoretical approach to describing such interactions. We consider a game where the host and pathogen are players and they can choose between the strategies of defense (or counter-defense) and no response. Specifically, they may or may not produce a toxin and an enzyme degrading the toxin, respectively. We consider that the host and pathogen must also incur a cost for toxin or enzyme production. We highlight both the sequential and non-sequential versions of the game and determine the Nash equilibria. Further, we resolve a paradox occurring in that interplay. If the inactivating enzyme is very efficient, producing the toxin becomes useless, leading to the enzyme being no longer required. Then, production of the defense becomes useful again. In game theory, such situations can be described by a generalized matching pennies game. As a novel result, we find under which conditions the defense cycle leads to a steady state or to an oscillation. We obtain, for saturating dose-response kinetics and considering monotonic cost functions, 'partial (counter-)defense' strategies as pure Nash equilibria. This implies that producing a moderate amount of toxin and enzyme is the best choice.
{"title":"How hosts and pathogens choose the strengths of defense and counter-defense. A game-theoretical view","authors":"Shalu Dwivedi, Ravindra Garde, Stefan Schuster","doi":"arxiv-2409.04497","DOIUrl":"https://doi.org/arxiv-2409.04497","url":null,"abstract":"Host-pathogen interactions consist of an attack by the pathogen, frequently a\u0000defense by the host and possibly a counter-defense by the pathogen. Here, we\u0000present a game-theoretical approach to describing such interactions. We\u0000consider a game where the host and pathogen are players and they can choose\u0000between the strategies of defense (or counter-defense) and no response.\u0000Specifically, they may or may not produce a toxin and an enzyme degrading the\u0000toxin, respectively. We consider that the host and pathogen must also incur a\u0000cost for toxin or enzyme production. We highlight both the sequential and\u0000non-sequential versions of the game and determine the Nash equilibria. Further,\u0000we resolve a paradox occurring in that interplay. If the inactivating enzyme is\u0000very efficient, producing the toxin becomes useless, leading to the enzyme\u0000being no longer required. Then, production of the defense becomes useful again.\u0000In game theory, such situations can be described by a generalized matching\u0000pennies game. As a novel result, we find under which conditions the defense\u0000cycle leads to a steady state or to an oscillation. We obtain, for saturating\u0000dose-response kinetics and considering monotonic cost functions, 'partial\u0000(counter-)defense' strategies as pure Nash equilibria. This implies that\u0000producing a moderate amount of toxin and enzyme is the best choice.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142213283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huma PerveenSchool of Mathematical and Physical Sciences, University of Sussex, Brighton, UK, Julie WeedsSchool of Engineering and Informatics, University of Sussex, Brighton, UK
Proteins are essential to numerous biological functions, with their sequences determining their roles within organisms. Traditional methods for determining protein function are time-consuming and labor-intensive. This study addresses the increasing demand for precise, effective, and automated protein sequence classification methods by employing natural language processing (NLP) techniques on a dataset comprising 75 target protein classes. We explored various machine learning and deep learning models, including K-Nearest Neighbors (KNN), Multinomial Na"ive Bayes, Logistic Regression, Multi-Layer Perceptron (MLP), Decision Tree, Random Forest, XGBoost, Voting and Stacking classifiers, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and transformer models (BertForSequenceClassification, DistilBERT, and ProtBert). Experiments were conducted using amino acid ranges of 1-4 grams for machine learning models and different sequence lengths for CNN and LSTM models. The KNN algorithm performed best on tri-gram data with 70.0% accuracy and a macro F1 score of 63.0%. The Voting classifier achieved best performance with 74.0% accuracy and an F1 score of 65.0%, while the Stacking classifier reached 75.0% accuracy and an F1 score of 64.0%. ProtBert demonstrated the highest performance among transformer models, with a accuracy 76.0% and F1 score 61.0% which is same for all three transformer models. Advanced NLP techniques, particularly ensemble methods and transformer models, show great potential in protein classification. Our results demonstrate that ensemble methods, particularly Voting Soft classifiers, achieved superior results, highlighting the importance of sufficient training data and addressing sequence similarity across different classes.
{"title":"Protein sequence classification using natural language processing techniques","authors":"Huma PerveenSchool of Mathematical and Physical Sciences, University of Sussex, Brighton, UK, Julie WeedsSchool of Engineering and Informatics, University of Sussex, Brighton, UK","doi":"arxiv-2409.04491","DOIUrl":"https://doi.org/arxiv-2409.04491","url":null,"abstract":"Proteins are essential to numerous biological functions, with their sequences\u0000determining their roles within organisms. Traditional methods for determining\u0000protein function are time-consuming and labor-intensive. This study addresses\u0000the increasing demand for precise, effective, and automated protein sequence\u0000classification methods by employing natural language processing (NLP)\u0000techniques on a dataset comprising 75 target protein classes. We explored\u0000various machine learning and deep learning models, including K-Nearest\u0000Neighbors (KNN), Multinomial Na\"ive Bayes, Logistic Regression, Multi-Layer\u0000Perceptron (MLP), Decision Tree, Random Forest, XGBoost, Voting and Stacking\u0000classifiers, Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM),\u0000and transformer models (BertForSequenceClassification, DistilBERT, and\u0000ProtBert). Experiments were conducted using amino acid ranges of 1-4 grams for\u0000machine learning models and different sequence lengths for CNN and LSTM models.\u0000The KNN algorithm performed best on tri-gram data with 70.0% accuracy and a\u0000macro F1 score of 63.0%. The Voting classifier achieved best performance with\u000074.0% accuracy and an F1 score of 65.0%, while the Stacking classifier reached\u000075.0% accuracy and an F1 score of 64.0%. ProtBert demonstrated the highest\u0000performance among transformer models, with a accuracy 76.0% and F1 score 61.0%\u0000which is same for all three transformer models. Advanced NLP techniques,\u0000particularly ensemble methods and transformer models, show great potential in\u0000protein classification. Our results demonstrate that ensemble methods,\u0000particularly Voting Soft classifiers, achieved superior results, highlighting\u0000the importance of sufficient training data and addressing sequence similarity\u0000across different classes.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"408 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142213284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Background. Clinical data warehouses (CDWs) are essential in the reuse of hospital data in observational studies or predictive modeling. However, state of-the-art CDW systems present two drawbacks. First, they do not support the management of large data files, what is critical in medical genomics, radiology, digital pathology, and other domains where such files are generated. Second, they do not provide provenance management or means to represent longitudinal relationships between patient events. Indeed, a disease diagnosis and its follow-up rely on multiple analyses. In these cases no relationship between the data (e.g., a large file) and its associated analysis and decision can be documented.Method. We introduce gitOmmix, an approach that overcomes these limitations, and illustrate its usefulness in the management of medical omics data. gitOmmix relies on (i) a file versioning system: git, (ii) an extension that handles large files: git-annex, (iii) a provenance knowledge graph: PROV-O, and (iv) an alignment between the git versioning information and the provenance knowledge graph.Results. Capabilities inherited from git and git-annex enable retracing the history of a clinical interpretation back to the patient sample, through supporting data and analyses. In addition, the provenance knowledge graph, aligned with the git versioning information, enables querying and browsing provenance relationships between these elements.Conclusion. gitOmmix adds a provenance layer to CDWs, while scaling to large files and being agnostic of the CDW system. For these reasons, we think that it is a viable and generalizable solution for omics clinical studies.
{"title":"Enhancing Clinical Data Warehouses with Provenance and Large File Management: The gitOmmix Approach for Clinical Omics Data","authors":"Maxime WackCRC, HeKA, HEGP, CHNO, Adrien CouletCRC, HeKA, Anita BurgunHEGP, Imagine, Bastien RanceUPCité, HEGP, CRC, HeKA","doi":"arxiv-2409.03288","DOIUrl":"https://doi.org/arxiv-2409.03288","url":null,"abstract":"Background. Clinical data warehouses (CDWs) are essential in the reuse of\u0000hospital data in observational studies or predictive modeling. However, state\u0000of-the-art CDW systems present two drawbacks. First, they do not support the\u0000management of large data files, what is critical in medical genomics,\u0000radiology, digital pathology, and other domains where such files are generated.\u0000Second, they do not provide provenance management or means to represent\u0000longitudinal relationships between patient events. Indeed, a disease diagnosis\u0000and its follow-up rely on multiple analyses. In these cases no relationship\u0000between the data (e.g., a large file) and its associated analysis and decision\u0000can be documented.Method. We introduce gitOmmix, an approach that overcomes\u0000these limitations, and illustrate its usefulness in the management of medical\u0000omics data. gitOmmix relies on (i) a file versioning system: git, (ii) an\u0000extension that handles large files: git-annex, (iii) a provenance knowledge\u0000graph: PROV-O, and (iv) an alignment between the git versioning information and\u0000the provenance knowledge graph.Results. Capabilities inherited from git and\u0000git-annex enable retracing the history of a clinical interpretation back to the\u0000patient sample, through supporting data and analyses. In addition, the\u0000provenance knowledge graph, aligned with the git versioning information,\u0000enables querying and browsing provenance relationships between these\u0000elements.Conclusion. gitOmmix adds a provenance layer to CDWs, while scaling to\u0000large files and being agnostic of the CDW system. For these reasons, we think\u0000that it is a viable and generalizable solution for omics clinical studies.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142213289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Göksel Keskin, Olivier Duriez, Pedro Lacerda, Andrea Flack, Máté Nagy
Thermal soaring enables birds to perform cost-efficient flights during foraging or migration trips. Yet, although all soaring birds exploit vertical winds effectively, this group contains species that vary strongly in their morphologies. Aerodynamic rules dictate the costs and benefits of flight, but, depending on their ecological needs, species may use different behavioural strategies. To quantify these morphology-related differences in behavioural cross-country strategies, we compiled and analysed a large dataset, which includes data from over a hundred individuals from 12 soaring species recorded with high frequency tracking devices. We quantified the performance during thermalling and gliding flights, and the overall cross-country behaviour that is the combination of both. Our results confirmed aerodynamic theory across the 12 species; species with higher wing loading typically flew faster, and consequently turned on a larger radius, than lighter ones. Furthermore, the combination of circling radius and minimum sink speed determines the maximum benefits soaring birds can obtain from thermals. Also, we observed a spectrum of strategies regarding the adaptivity to thermal strength and uncovered a universal rule for cross-country strategies for all analysed species. Finally, our newly described behavioural rules can provide inspirations for technical applications, like the development of autopilot systems for autonomous robotic gliders.
{"title":"Adaptive cross-country optimisation strategies in thermal soaring birds","authors":"Göksel Keskin, Olivier Duriez, Pedro Lacerda, Andrea Flack, Máté Nagy","doi":"arxiv-2409.03849","DOIUrl":"https://doi.org/arxiv-2409.03849","url":null,"abstract":"Thermal soaring enables birds to perform cost-efficient flights during\u0000foraging or migration trips. Yet, although all soaring birds exploit vertical\u0000winds effectively, this group contains species that vary strongly in their\u0000morphologies. Aerodynamic rules dictate the costs and benefits of flight, but,\u0000depending on their ecological needs, species may use different behavioural\u0000strategies. To quantify these morphology-related differences in behavioural\u0000cross-country strategies, we compiled and analysed a large dataset, which\u0000includes data from over a hundred individuals from 12 soaring species recorded\u0000with high frequency tracking devices. We quantified the performance during\u0000thermalling and gliding flights, and the overall cross-country behaviour that\u0000is the combination of both. Our results confirmed aerodynamic theory across the\u000012 species; species with higher wing loading typically flew faster, and\u0000consequently turned on a larger radius, than lighter ones. Furthermore, the\u0000combination of circling radius and minimum sink speed determines the maximum\u0000benefits soaring birds can obtain from thermals. Also, we observed a spectrum\u0000of strategies regarding the adaptivity to thermal strength and uncovered a\u0000universal rule for cross-country strategies for all analysed species. Finally,\u0000our newly described behavioural rules can provide inspirations for technical\u0000applications, like the development of autopilot systems for autonomous robotic\u0000gliders.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"111 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142213288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The SIR model is a classical model characterizing the spreading of infectious diseases. This model describes the time-dependent quantity changes among Susceptible, Infectious, and Recovered groups. By introducing space-depend effects such as diffusion and creation in addition to the SIR model, the Fisher's model is in fact a more advanced and comprehensive model. However, the Fisher's model is much less popular than the SIR model in simulating infectious disease numerically due to the difficulties from the parameter selection, the involvement of 2-d/3-d spacial effects, the configuration of the boundary conditions, etc. This paper aim to address these issues by providing numerical algorithms involving space and time finite difference schemes and iterative methods, and its open-source Python code for solving the Fisher's model. This 2-D Fisher's solver is second order in space and up to the second order in time, which is rigorously verified using test cases with analytical solutions. Numerical algorithms such as SOR, implicit Euler, Staggered Crank-Nicolson, and ADI are combined to improve the efficiency and accuracy of the solver. It can handle various boundary conditions subject to different physical descriptions. In addition, real-world data of Covid-19 are used by the model to demonstrate its practical usage in providing prediction and inferences.
SIR 模型是描述传染病传播特征的经典模型。该模型描述了易感群体、感染群体和康复群体之间随时间变化的数量变化。费舍模型在 SIR 模型的基础上引入了扩散和创造等空间依赖效应,实际上是一个更先进、更全面的模型。然而,由于参数选择、2-d/3-d 空间效应的参与、边界条件的配置等方面的困难,Fisher 模型在数值模拟传染病方面的应用远不如 SIR 模型。本文旨在解决这些问题,提供了涉及空间和时间有限差分方案和迭代方法的数值算法,以及用于求解费希尔模型的开源 Python 代码。这个二维费雪求解器在空间上是二阶的,在时间上也达到了二阶,这一点通过分析求解的测试案例得到了可靠验证。为了提高求解器的效率和精度,我们将 SOR、隐式欧拉、交错曲柄-尼科尔森和 ADI 等数值算法结合在一起。它可以处理不同物理描述下的各种边界条件。此外,该模型还使用了 Covid-19 的实际数据,以证明其在提供预测和推论方面的实用性。
{"title":"Temporal and Spacial Studies of Infectious Diseases: Mathematical Models and Numerical Solvers","authors":"Md Abu Talha, Yongjia Xu, Shan Zhao, Weihua Geng","doi":"arxiv-2409.10556","DOIUrl":"https://doi.org/arxiv-2409.10556","url":null,"abstract":"The SIR model is a classical model characterizing the spreading of infectious\u0000diseases. This model describes the time-dependent quantity changes among\u0000Susceptible, Infectious, and Recovered groups. By introducing space-depend\u0000effects such as diffusion and creation in addition to the SIR model, the\u0000Fisher's model is in fact a more advanced and comprehensive model. However, the\u0000Fisher's model is much less popular than the SIR model in simulating infectious\u0000disease numerically due to the difficulties from the parameter selection, the\u0000involvement of 2-d/3-d spacial effects, the configuration of the boundary\u0000conditions, etc. This paper aim to address these issues by providing numerical algorithms\u0000involving space and time finite difference schemes and iterative methods, and\u0000its open-source Python code for solving the Fisher's model. This 2-D Fisher's\u0000solver is second order in space and up to the second order in time, which is\u0000rigorously verified using test cases with analytical solutions. Numerical\u0000algorithms such as SOR, implicit Euler, Staggered Crank-Nicolson, and ADI are\u0000combined to improve the efficiency and accuracy of the solver. It can handle\u0000various boundary conditions subject to different physical descriptions. In\u0000addition, real-world data of Covid-19 are used by the model to demonstrate its\u0000practical usage in providing prediction and inferences.","PeriodicalId":501266,"journal":{"name":"arXiv - QuanBio - Quantitative Methods","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142264958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}