首页 > 最新文献

PLoS Computational Biology最新文献

英文 中文
Calibrating dimension reduction hyperparameters in the presence of noise 在存在噪声的情况下校准降维超参数
IF 4.3 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-12 DOI: 10.1371/journal.pcbi.1012427
Justin Lin, Julia Fukuyama
The goal of dimension reduction tools is to construct a low-dimensional representation of high-dimensional data. These tools are employed for a variety of reasons such as noise reduction, visualization, and to lower computational costs. However, there is a fundamental issue that is discussed in other modeling problems that is often overlooked in dimension reduction—overfitting. In the context of other modeling problems, techniques such as feature-selection, cross-validation, and regularization are employed to combat overfitting, but rarely are such precautions taken when applying dimension reduction. Prior applications of the two most popular non-linear dimension reduction methods, t-SNE and UMAP, fail to acknowledge data as a combination of signal and noise when assessing performance. These methods are typically calibrated to capture the entirety of the data, not just the signal. In this paper, we demonstrate the importance of acknowledging noise when calibrating hyperparameters and present a framework that enables users to do so. We use this framework to explore the role hyperparameter calibration plays in overfitting the data when applying t-SNE and UMAP. More specifically, we show previously recommended values for perplexity and n_neighbors are too small and overfit the noise. We also provide a workflow others may use to calibrate hyperparameters in the presence of noise.
降维工具的目标是构建高维数据的低维表示。使用这些工具有多种原因,如减少噪音、可视化和降低计算成本。然而,在其他建模问题中讨论过的一个基本问题在降维过程中往往被忽视--过度拟合。在其他建模问题中,会采用特征选择、交叉验证和正则化等技术来对抗过拟合,但在应用降维时却很少采取这样的预防措施。之前应用的两种最流行的非线性降维方法--t-SNE 和 UMAP,在评估性能时没有将数据视为信号和噪声的组合。这些方法通常是为了捕捉整个数据,而不仅仅是信号。在本文中,我们证明了在校准超参数时承认噪声的重要性,并提出了一个能让用户这样做的框架。在应用 t-SNE 和 UMAP 时,我们利用这一框架来探讨超参数校准在过度拟合数据中的作用。更具体地说,我们发现之前推荐的复杂度和 n_neighbors 值太小,会过度拟合噪声。我们还提供了一个工作流程,其他人可以用来在有噪声的情况下校准超参数。
{"title":"Calibrating dimension reduction hyperparameters in the presence of noise","authors":"Justin Lin, Julia Fukuyama","doi":"10.1371/journal.pcbi.1012427","DOIUrl":"https://doi.org/10.1371/journal.pcbi.1012427","url":null,"abstract":"The goal of dimension reduction tools is to construct a low-dimensional representation of high-dimensional data. These tools are employed for a variety of reasons such as noise reduction, visualization, and to lower computational costs. However, there is a fundamental issue that is discussed in other modeling problems that is often overlooked in dimension reduction—overfitting. In the context of other modeling problems, techniques such as feature-selection, cross-validation, and regularization are employed to combat overfitting, but rarely are such precautions taken when applying dimension reduction. Prior applications of the two most popular non-linear dimension reduction methods, t-SNE and UMAP, fail to acknowledge data as a combination of signal and noise when assessing performance. These methods are typically calibrated to capture the entirety of the data, not just the signal. In this paper, we demonstrate the importance of acknowledging noise when calibrating hyperparameters and present a framework that enables users to do so. We use this framework to explore the role hyperparameter calibration plays in overfitting the data when applying t-SNE and UMAP. More specifically, we show previously recommended values for perplexity and n_neighbors are too small and overfit the noise. We also provide a workflow others may use to calibrate hyperparameters in the presence of noise.","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Shedding light on blue-green photosynthesis: A wavelength-dependent mathematical model of photosynthesis in Synechocystis sp. PCC 6803 揭示蓝绿光合作用:Synechocystis sp. PCC 6803光合作用的波长依赖性数学模型
IF 4.3 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-12 DOI: 10.1371/journal.pcbi.1012445
Tobias Pfennig, Elena Kullmann, Tomáš Zavřel, Andreas Nakielski, Oliver Ebenhöh, Jan Červený, Gábor Bernát, Anna Barbara Matuszyńska
Cyanobacteria hold great potential to revolutionize conventional industries and farming practices with their light-driven chemical production. To fully exploit their photosynthetic capacity and enhance product yield, it is crucial to investigate their intricate interplay with the environment including the light intensity and spectrum. Mathematical models provide valuable insights for optimizing strategies in this pursuit. In this study, we present an ordinary differential equation-based model for the cyanobacterium Synechocystis sp. PCC 6803 to assess its performance under various light sources, including monochromatic light. Our model can reproduce a variety of physiologically measured quantities, e.g. experimentally reported partitioning of electrons through four main pathways, O2 evolution, and the rate of carbon fixation for ambient and saturated CO2. By capturing the interactions between different components of a photosynthetic system, our model helps in understanding the underlying mechanisms driving system behavior. Our model qualitatively reproduces fluorescence emitted under various light regimes, replicating Pulse-amplitude modulation (PAM) fluorometry experiments with saturating pulses. Using our model, we test four hypothesized mechanisms of cyanobacterial state transitions for ensemble of parameter sets and found no physiological benefit of a model assuming phycobilisome detachment. Moreover, we evaluate metabolic control for biotechnological production under diverse light colors and irradiances. We suggest gene targets for overexpression under different illuminations to increase the yield. By offering a comprehensive computational model of cyanobacterial photosynthesis, our work enhances the basic understanding of light-dependent cyanobacterial behavior and sets the first wavelength-dependent framework to systematically test their producing capacity for biocatalysis.
蓝藻以其光驱动的化学生产方式,具有彻底改变传统工业和农业生产方式的巨大潜力。要充分利用其光合能力并提高产品产量,研究其与环境(包括光照强度和光谱)之间错综复杂的相互作用至关重要。数学模型为优化这一过程中的策略提供了宝贵的见解。在本研究中,我们为蓝藻 Synechocystis sp. PCC 6803 提出了一个基于常微分方程的模型,以评估其在各种光源(包括单色光)下的表现。我们的模型可以再现各种生理测量量,如实验报告的电子通过四种主要途径的分配、氧气进化以及环境和饱和 CO2 下的碳固定速率。通过捕捉光合作用系统不同组成部分之间的相互作用,我们的模型有助于理解驱动系统行为的潜在机制。我们的模型定性地再现了各种光照条件下发出的荧光,复制了饱和脉冲的脉冲幅度调制(PAM)荧光测定实验。利用我们的模型,我们测试了四种假定的蓝藻状态转换机制,并发现假定藻体脱离的模型在生理上没有任何益处。此外,我们还评估了不同光色和辐照条件下生物技术生产的代谢控制。我们提出了在不同光照条件下过度表达以提高产量的基因目标。通过提供蓝藻光合作用的综合计算模型,我们的工作增强了人们对蓝藻依赖光的行为的基本认识,并首次建立了依赖波长的框架,以系统地测试蓝藻的生物催化生产能力。
{"title":"Shedding light on blue-green photosynthesis: A wavelength-dependent mathematical model of photosynthesis in Synechocystis sp. PCC 6803","authors":"Tobias Pfennig, Elena Kullmann, Tomáš Zavřel, Andreas Nakielski, Oliver Ebenhöh, Jan Červený, Gábor Bernát, Anna Barbara Matuszyńska","doi":"10.1371/journal.pcbi.1012445","DOIUrl":"https://doi.org/10.1371/journal.pcbi.1012445","url":null,"abstract":"Cyanobacteria hold great potential to revolutionize conventional industries and farming practices with their light-driven chemical production. To fully exploit their photosynthetic capacity and enhance product yield, it is crucial to investigate their intricate interplay with the environment including the light intensity and spectrum. Mathematical models provide valuable insights for optimizing strategies in this pursuit. In this study, we present an ordinary differential equation-based model for the cyanobacterium <jats:italic>Synechocystis</jats:italic> sp. PCC 6803 to assess its performance under various light sources, including monochromatic light. Our model can reproduce a variety of physiologically measured quantities, e.g. experimentally reported partitioning of electrons through four main pathways, O<jats:sub>2</jats:sub> evolution, and the rate of carbon fixation for ambient and saturated CO<jats:sub>2</jats:sub>. By capturing the interactions between different components of a photosynthetic system, our model helps in understanding the underlying mechanisms driving system behavior. Our model qualitatively reproduces fluorescence emitted under various light regimes, replicating Pulse-amplitude modulation (PAM) fluorometry experiments with saturating pulses. Using our model, we test four hypothesized mechanisms of cyanobacterial state transitions for ensemble of parameter sets and found no physiological benefit of a model assuming phycobilisome detachment. Moreover, we evaluate metabolic control for biotechnological production under diverse light colors and irradiances. We suggest gene targets for overexpression under different illuminations to increase the yield. By offering a comprehensive computational model of cyanobacterial photosynthesis, our work enhances the basic understanding of light-dependent cyanobacterial behavior and sets the first wavelength-dependent framework to systematically test their producing capacity for biocatalysis.","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An exploration into CTEPH medications: Combining natural language processing, embedding learning, in vitro models, and real-world evidence for drug repurposing 对 CTEPH 药物的探索:将自然语言处理、嵌入式学习、体外模型和真实世界证据相结合,促进药物再利用
IF 4.3 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-12 DOI: 10.1371/journal.pcbi.1012417
Daniel Steiert, Corey Wittig, Priyanka Banerjee, Robert Preissner, Robert Szulcek
Background In the modern era, the growth of scientific literature presents a daunting challenge for researchers to keep informed of advancements across multiple disciplines. Objective We apply natural language processing (NLP) and embedding learning concepts to design PubDigest, a tool that combs PubMed literature, aiming to pinpoint potential drugs that could be repurposed. Methods Using NLP, especially term associations through word embeddings, we explored unrecognized relationships between drugs and diseases. To illustrate the utility of PubDigest, we focused on chronic thromboembolic pulmonary hypertension (CTEPH), a rare disease with an overall limited number of scientific publications. Results Our literature analysis identified key clinical features linked to CTEPH by applying term frequency-inverse document frequency (TF-IDF) scoring, a technique measuring a term’s significance in a text corpus. This allowed us to map related diseases. One standout was venous thrombosis (VT), which showed strong semantic links with CTEPH. Looking deeper, we discovered potential repurposing candidates for CTEPH through large-scale neural network-based contextualization of literature and predictive modeling on both the CTEPH and the VT literature corpora to find novel, yet unrecognized associations between the two diseases. Alongside the anti-thrombotic agent caplacizumab, benzofuran derivatives were an intriguing find. In particular, the benzofuran derivative amiodarone displayed potential anti-thrombotic properties in the literature. Our in vitro tests confirmed amiodarone’s ability to reduce platelet aggregation significantly by 68% (p = 0.02). However, real-world clinical data indicated that CTEPH patients receiving amiodarone treatment faced a significant 15.9% higher mortality risk (p<0.001). Conclusions While NLP offers an innovative approach to interpreting scientific literature, especially for drug repurposing, it is crucial to combine it with complementary methods like in vitro testing and real-world evidence. Our exploration with benzofuran derivatives and CTEPH underscores this point. Thus, blending NLP with hands-on experiments and real-world clinical data can pave the way for faster and safer drug repurposing approaches, especially for rare diseases like CTEPH.
背景 在现代社会,科学文献的增长为研究人员了解多个学科的进展带来了严峻的挑战。目的 我们运用自然语言处理(NLP)和嵌入式学习概念设计了 PubDigest,这是一种梳理 PubMed 文献的工具,旨在找出可重新利用的潜在药物。方法 利用 NLP,特别是通过词嵌入的术语关联,我们探索了药物与疾病之间未被认识到的关系。为了说明 PubDigest 的实用性,我们重点研究了慢性血栓栓塞性肺动脉高压 (CTEPH),这是一种罕见疾病,发表的科学文献总体数量有限。结果 我们的文献分析通过应用术语频率-反向文档频率(TF-IDF)评分(一种衡量术语在文本语料库中重要性的技术)确定了与 CTEPH 相关的关键临床特征。这样我们就能绘制出相关疾病的地图。其中最突出的是静脉血栓 (VT),它与 CTEPH 有很强的语义联系。深入研究后,我们发现了 CTEPH 的潜在再利用候选词,通过基于大规模神经网络的文献上下文化和 CTEPH 与 VT 文献库的预测建模,我们找到了这两种疾病之间尚未认识到的新关联。除了抗血栓药物卡普珠单抗外,苯并呋喃衍生物也是一个引人入胜的发现。特别是,苯并呋喃衍生物胺碘酮在文献中显示出潜在的抗血栓特性。我们的体外测试证实,胺碘酮能够将血小板聚集率显著降低 68%(p = 0.02)。然而,实际临床数据表明,接受胺碘酮治疗的 CTEPH 患者面临的死亡风险显著增加了 15.9%(p<0.001)。结论 NLP为解读科学文献提供了一种创新方法,特别是在药物再利用方面,但将其与体外测试和真实世界证据等补充方法结合起来至关重要。我们在苯并呋喃衍生物和CTEPH方面的探索强调了这一点。因此,将NLP与实践实验和真实世界的临床数据相结合,可以为更快、更安全的药物再利用方法铺平道路,尤其是针对CTEPH等罕见疾病。
{"title":"An exploration into CTEPH medications: Combining natural language processing, embedding learning, in vitro models, and real-world evidence for drug repurposing","authors":"Daniel Steiert, Corey Wittig, Priyanka Banerjee, Robert Preissner, Robert Szulcek","doi":"10.1371/journal.pcbi.1012417","DOIUrl":"https://doi.org/10.1371/journal.pcbi.1012417","url":null,"abstract":"Background In the modern era, the growth of scientific literature presents a daunting challenge for researchers to keep informed of advancements across multiple disciplines. Objective We apply natural language processing (NLP) and embedding learning concepts to design PubDigest, a tool that combs PubMed literature, aiming to pinpoint potential drugs that could be repurposed. Methods Using NLP, especially term associations through word embeddings, we explored unrecognized relationships between drugs and diseases. To illustrate the utility of PubDigest, we focused on chronic thromboembolic pulmonary hypertension (CTEPH), a rare disease with an overall limited number of scientific publications. Results Our literature analysis identified key clinical features linked to CTEPH by applying term frequency-inverse document frequency (TF-IDF) scoring, a technique measuring a term’s significance in a text corpus. This allowed us to map related diseases. One standout was venous thrombosis (VT), which showed strong semantic links with CTEPH. Looking deeper, we discovered potential repurposing candidates for CTEPH through large-scale neural network-based contextualization of literature and predictive modeling on both the CTEPH and the VT literature corpora to find novel, yet unrecognized associations between the two diseases. Alongside the anti-thrombotic agent caplacizumab, benzofuran derivatives were an intriguing find. In particular, the benzofuran derivative amiodarone displayed potential anti-thrombotic properties in the literature. Our <jats:italic>in vitro</jats:italic> tests confirmed amiodarone’s ability to reduce platelet aggregation significantly by 68% (p = 0.02). However, real-world clinical data indicated that CTEPH patients receiving amiodarone treatment faced a significant 15.9% higher mortality risk (p&lt;0.001). Conclusions While NLP offers an innovative approach to interpreting scientific literature, especially for drug repurposing, it is crucial to combine it with complementary methods like <jats:italic>in vitro</jats:italic> testing and real-world evidence. Our exploration with benzofuran derivatives and CTEPH underscores this point. Thus, blending NLP with hands-on experiments and real-world clinical data can pave the way for faster and safer drug repurposing approaches, especially for rare diseases like CTEPH.","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comparison and benchmark of deep learning methods for non-coding RNA classification 非编码 RNA 分类深度学习方法的比较与基准
IF 4.3 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-12 DOI: 10.1371/journal.pcbi.1012446
Constance Creux, Farida Zehraoui, François Radvanyi, Fariza Tahi
The involvement of non-coding RNAs in biological processes and diseases has made the exploration of their functions crucial. Most non-coding RNAs have yet to be studied, creating the need for methods that can rapidly classify large sets of non-coding RNAs into functional groups, or classes. In recent years, the success of deep learning in various domains led to its application to non-coding RNA classification. Multiple novel architectures have been developed, but these advancements are not covered by current literature reviews. We present an exhaustive comparison of the different methods proposed in the state-of-the-art and describe their associated datasets. Moreover, the literature lacks objective benchmarks. We perform experiments to fairly evaluate the performance of various tools for non-coding RNA classification on popular datasets. The robustness of methods to non-functional sequences and sequence boundary noise is explored. We also measure computation time and CO2 emissions. With regard to these results, we assess the relevance of the different architectural choices and provide recommendations to consider in future methods.
非编码 RNA 在生物过程和疾病中的参与使得对其功能的探索变得至关重要。大多数非编码 RNA 尚待研究,因此需要能将大量非编码 RNA 快速分类为功能组或类别的方法。近年来,深度学习在各个领域取得的成功促使其应用于非编码 RNA 分类。目前已开发出多种新型架构,但现有文献综述并未涵盖这些进展。我们对最先进的不同方法进行了详尽的比较,并介绍了它们的相关数据集。此外,文献中缺乏客观的基准。我们进行了实验,以公平地评估各种工具在流行数据集上进行非编码 RNA 分类的性能。我们探讨了各种方法对非功能序列和序列边界噪声的鲁棒性。我们还测量了计算时间和二氧化碳排放量。根据这些结果,我们评估了不同架构选择的相关性,并为未来的方法提供了建议。
{"title":"Comparison and benchmark of deep learning methods for non-coding RNA classification","authors":"Constance Creux, Farida Zehraoui, François Radvanyi, Fariza Tahi","doi":"10.1371/journal.pcbi.1012446","DOIUrl":"https://doi.org/10.1371/journal.pcbi.1012446","url":null,"abstract":"The involvement of non-coding RNAs in biological processes and diseases has made the exploration of their functions crucial. Most non-coding RNAs have yet to be studied, creating the need for methods that can rapidly classify large sets of non-coding RNAs into functional groups, or classes. In recent years, the success of deep learning in various domains led to its application to non-coding RNA classification. Multiple novel architectures have been developed, but these advancements are not covered by current literature reviews. We present an exhaustive comparison of the different methods proposed in the state-of-the-art and describe their associated datasets. Moreover, the literature lacks objective benchmarks. We perform experiments to fairly evaluate the performance of various tools for non-coding RNA classification on popular datasets. The robustness of methods to non-functional sequences and sequence boundary noise is explored. We also measure computation time and CO<jats:sub>2</jats:sub> emissions. With regard to these results, we assess the relevance of the different architectural choices and provide recommendations to consider in future methods.","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142225046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Design and implementation of an asynchronous online course-based undergraduate research experience (CURE) in computational genomics 设计和实施基于异步在线课程的计算基因组学本科生研究体验 (CURE)
IF 4.3 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-12 DOI: 10.1371/journal.pcbi.1012384
Seema B. Plaisier, Danielle O. Alarid, Joelle A. Denning, Sara E. Brownell, Kenneth H. Buetow, Katelyn M. Cooper, Melissa A. Wilson
As genomics technologies advance, there is a growing demand for computational biologists trained for genomics analysis but instructors face significant hurdles in providing formal training in computer programming, statistics, and genomics to biology students. Fully online learners represent a significant and growing community that can contribute to meet this need, but they are frequently excluded from valuable research opportunities which mostly do not offer the flexibility they need. To address these opportunity gaps, we developed an asynchronous course-based undergraduate research experience (CURE) for computational genomics specifically for fully online biology students. We generated custom learning materials and leveraged remotely accessible computational tools to address 2 novel research questions over 2 iterations of the genomics CURE, one testing bioinformatics approaches and one mining cancer genomics data. Here, we present how the instructional team distributed analysis needed to address these questions between students over a 7.5-week CURE and provided concurrent training in biology and statistics, computer programming, and professional development. Scores from identical learning assessments administered before and after completion of each CURE showed significant learning gains across biology and coding course objectives. Open-response progress reports were submitted weekly and identified self-reported adaptive coping strategies for challenges encountered throughout the course. Progress reports identified problems that could be resolved through collaboration with instructors and peers via messaging platforms and virtual meetings. We implemented asynchronous communication using the Slack messaging platform and an asynchronous journal club where students discussed relevant publications using the Perusall social annotation platform. The online genomics CURE resulted in unanticipated positive outcomes, including students voluntarily discussing plans to continue research after the course. These outcomes underscore the effectiveness of this genomics CURE for scientific training, recruitment and student-mentor relationships, and student successes. Asynchronous genomics CUREs can contribute to a more skilled, diverse, and inclusive workforce for the advancement of biomedical science.
随着基因组学技术的发展,对接受过基因组学分析培训的计算生物学家的需求日益增长,但教师在为生物学学生提供计算机编程、统计和基因组学方面的正规培训时却面临巨大障碍。完全在线学习者是一个重要且不断增长的群体,他们可以为满足这一需求做出贡献,但他们经常被排除在宝贵的研究机会之外,而这些机会大多不具备他们所需的灵活性。为了弥补这些机会上的差距,我们专门为完全在线的生物专业学生开发了基于异步课程的计算基因组学本科生研究体验(CURE)。我们生成了定制的学习材料,并利用可远程访问的计算工具,在基因组学 CURE 的两次迭代中解决了两个新的研究问题,一个是测试生物信息学方法,另一个是挖掘癌症基因组学数据。在此,我们介绍了教学团队如何在为期 7.5 周的 CURE 中,在学生之间分配解决这些问题所需的分析,并同时提供生物学和统计学、计算机编程和专业发展方面的培训。每次 CURE 完成前后进行的相同学习评估结果显示,学生在生物和编码课程目标方面的学习效果显著提高。每周提交一次开放式反馈进度报告,并针对整个课程中遇到的挑战确定自我报告的适应性应对策略。进度报告指出了可以通过信息平台和虚拟会议与教师和同伴合作解决的问题。我们利用 Slack 消息平台实施了异步交流,还成立了异步期刊俱乐部,学生们利用 Perusall 社交注释平台讨论相关出版物。在线基因组学 CURE 取得了意想不到的积极成果,包括学生自愿讨论课程结束后继续研究的计划。这些成果凸显了该基因组学 CURE 在科学培训、招生、学生与导师关系以及学生成功方面的有效性。异步基因组学 CURE 可以为生物医学科学的发展培养一支更熟练、更多样、更包容的人才队伍。
{"title":"Design and implementation of an asynchronous online course-based undergraduate research experience (CURE) in computational genomics","authors":"Seema B. Plaisier, Danielle O. Alarid, Joelle A. Denning, Sara E. Brownell, Kenneth H. Buetow, Katelyn M. Cooper, Melissa A. Wilson","doi":"10.1371/journal.pcbi.1012384","DOIUrl":"https://doi.org/10.1371/journal.pcbi.1012384","url":null,"abstract":"As genomics technologies advance, there is a growing demand for computational biologists trained for genomics analysis but instructors face significant hurdles in providing formal training in computer programming, statistics, and genomics to biology students. Fully online learners represent a significant and growing community that can contribute to meet this need, but they are frequently excluded from valuable research opportunities which mostly do not offer the flexibility they need. To address these opportunity gaps, we developed an asynchronous course-based undergraduate research experience (CURE) for computational genomics specifically for fully online biology students. We generated custom learning materials and leveraged remotely accessible computational tools to address 2 novel research questions over 2 iterations of the genomics CURE, one testing bioinformatics approaches and one mining cancer genomics data. Here, we present how the instructional team distributed analysis needed to address these questions between students over a 7.5-week CURE and provided concurrent training in biology and statistics, computer programming, and professional development. Scores from identical learning assessments administered before and after completion of each CURE showed significant learning gains across biology and coding course objectives. Open-response progress reports were submitted weekly and identified self-reported adaptive coping strategies for challenges encountered throughout the course. Progress reports identified problems that could be resolved through collaboration with instructors and peers via messaging platforms and virtual meetings. We implemented asynchronous communication using the Slack messaging platform and an asynchronous journal club where students discussed relevant publications using the Perusall social annotation platform. The online genomics CURE resulted in unanticipated positive outcomes, including students voluntarily discussing plans to continue research after the course. These outcomes underscore the effectiveness of this genomics CURE for scientific training, recruitment and student-mentor relationships, and student successes. Asynchronous genomics CUREs can contribute to a more skilled, diverse, and inclusive workforce for the advancement of biomedical science.","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SEMbap: Bow-free covariance search and data de-correlation SEMbap:无弓协方差搜索和数据去相关性
IF 4.3 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-11 DOI: 10.1371/journal.pcbi.1012448
Mario Grassi, Barbara Tarantino
Large-scale studies of gene expression are commonly influenced by biological and technical sources of expression variation, including batch effects, sample characteristics, and environmental impacts. Learning the causal relationships between observable variables may be challenging in the presence of unobserved confounders. Furthermore, many high-dimensional regression techniques may perform worse. In fact, controlling for unobserved confounding variables is essential, and many deconfounding methods have been suggested for application in a variety of situations. The main contribution of this article is the development of a two-stage deconfounding procedure based on Bow-free Acyclic Paths (BAP) search developed into the framework of Structural Equation Models (SEM), called SEMbap(). In the first stage, an exhaustive search of missing edges with significant covariance is performed via Shipley d-separation tests; then, in the second stage, a Constrained Gaussian Graphical Model (CGGM) is fitted or a low dimensional representation of bow-free edges structure is obtained via Graph Laplacian Principal Component Analysis (gLPCA). We compare four popular deconfounding methods to BAP search approach with applications on simulated and observed expression data. In the former, different structures of the hidden covariance matrix have been replicated. Compared to existing methods, BAP search algorithm is able to correctly identify hidden confounding whilst controlling false positive rate and achieving good fitting and perturbation metrics.
大规模的基因表达研究通常会受到表达变异的生物和技术来源的影响,包括批次效应、样本特征和环境影响。在存在未观察到的混杂因素的情况下,学习可观察变量之间的因果关系可能具有挑战性。此外,许多高维回归技术的性能可能会更差。事实上,控制未观察到的混杂变量是非常必要的,而且已经提出了许多适用于各种情况的去混杂方法。本文的主要贡献是基于无鲍无环路径(BAP)搜索,在结构方程模型(SEM)框架内开发了一种两阶段去混淆程序,称为 SEMbap()。在第一阶段,通过 Shipley d-separation 检验对具有显著协方差的缺失边进行穷举搜索;然后,在第二阶段,拟合约束高斯图形模型(CGGM),或通过图形拉普拉斯主成分分析(gLPCA)获得无弓边结构的低维表示。我们比较了四种流行的去嵌方法和 BAP 搜索方法,并将其应用于模拟和观察表达数据。前者复制了隐藏协方差矩阵的不同结构。与现有方法相比,BAP 搜索算法能够正确识别隐藏混杂因素,同时控制假阳性率,并获得良好的拟合和扰动指标。
{"title":"SEMbap: Bow-free covariance search and data de-correlation","authors":"Mario Grassi, Barbara Tarantino","doi":"10.1371/journal.pcbi.1012448","DOIUrl":"https://doi.org/10.1371/journal.pcbi.1012448","url":null,"abstract":"Large-scale studies of gene expression are commonly influenced by biological and technical sources of expression variation, including batch effects, sample characteristics, and environmental impacts. Learning the causal relationships between observable variables may be challenging in the presence of unobserved confounders. Furthermore, many high-dimensional regression techniques may perform worse. In fact, controlling for unobserved confounding variables is essential, and many deconfounding methods have been suggested for application in a variety of situations. The main contribution of this article is the development of a two-stage deconfounding procedure based on Bow-free Acyclic Paths (BAP) search developed into the framework of Structural Equation Models (SEM), called <jats:monospace specific-use=\"no-wrap\">SEMbap()</jats:monospace>. In the first stage, an exhaustive search of missing edges with significant covariance is performed via Shipley d-separation tests; then, in the second stage, a Constrained Gaussian Graphical Model (CGGM) is fitted or a low dimensional representation of bow-free edges structure is obtained via Graph Laplacian Principal Component Analysis (gLPCA). We compare four popular deconfounding methods to BAP search approach with applications on simulated and observed expression data. In the former, different structures of the hidden covariance matrix have been replicated. Compared to existing methods, BAP search algorithm is able to correctly identify hidden confounding whilst controlling false positive rate and achieving good fitting and perturbation metrics.","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Creating cell-specific computational models of stem cell-derived cardiomyocytes using optical experiments 利用光学实验创建干细胞衍生心肌细胞的细胞特异性计算模型
IF 4.3 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-11 DOI: 10.1371/journal.pcbi.1011806
Janice Yang, Neil J. Daily, Taylor K. Pullinger, Tetsuro Wakatsuki, Eric A. Sobie
Human induced pluripotent stem cell-derived cardiomyocytes (iPSC-CMs) have gained traction as a powerful model in cardiac disease and therapeutics research, since iPSCs are self-renewing and can be derived from healthy and diseased patients without invasive surgery. However, current iPSC-CM differentiation methods produce cardiomyocytes with immature, fetal-like electrophysiological phenotypes, and the variety of maturation protocols in the literature results in phenotypic differences between labs. Heterogeneity of iPSC donor genetic backgrounds contributes to additional phenotypic variability. Several mathematical models of iPSC-CM electrophysiology have been developed to help to predict cell responses, but these models individually do not capture the phenotypic variability observed in iPSC-CMs. Here, we tackle these limitations by developing a computational pipeline to calibrate cell preparation-specific iPSC-CM electrophysiological parameters. We used the genetic algorithm (GA), a heuristic parameter calibration method, to tune ion channel parameters in a mathematical model of iPSC-CM physiology. To systematically optimize an experimental protocol that generates sufficient data for parameter calibration, we created in silico datasets by simulating various protocols applied to a population of models with known conductance variations, and then fitted parameters to those datasets. We found that calibrating to voltage and calcium transient data under 3 varied experimental conditions, including electrical pacing combined with ion channel blockade and changing buffer ion concentrations, improved model parameter estimates and model predictions of unseen channel block responses. This observation also held when the fitted data were normalized, suggesting that normalized fluorescence recordings, which are more accessible and higher throughput than patch clamp recordings, could sufficiently inform conductance parameters. Therefore, this computational pipeline can be applied to different iPSC-CM preparations to determine cell line-specific ion channel properties and understand the mechanisms behind variability in perturbation responses.
人类诱导多能干细胞衍生的心肌细胞(iPSC-CMs)作为心脏疾病和治疗学研究中的一种强大模型,已经获得了广泛的关注,因为iPSCs具有自我更新能力,而且无需侵入性手术即可从健康和患病患者身上提取。然而,目前的 iPSC-CM 分化方法产生的心肌细胞具有不成熟、类似胎儿的电生理表型,而且文献中的成熟方案多种多样,导致不同实验室的表型存在差异。iPSC 供体遗传背景的异质性也造成了表型的差异。目前已开发出几种 iPSC-CM 电生理学数学模型来帮助预测细胞反应,但这些模型无法单独捕捉 iPSC-CM 中观察到的表型变异性。在这里,我们通过开发一个计算管道来校准特定于细胞制备的 iPSC-CM 电生理参数,从而解决了这些局限性。我们使用遗传算法(GA)这种启发式参数校准方法来调整 iPSC-CM 生理数学模型中的离子通道参数。为了系统地优化实验方案,以便为参数校准生成足够的数据,我们通过模拟应用于已知电导变化的模型群体的各种方案,创建了硅学数据集,然后将参数拟合到这些数据集。我们发现,在 3 种不同的实验条件下校准电压和钙离子瞬态数据,包括电起搏结合离子通道阻断和改变缓冲离子浓度,可以改善模型参数估计和模型对未见通道阻断反应的预测。在对拟合数据进行归一化处理时,这一观察结果同样成立,这表明归一化荧光记录比膜片钳记录更容易获得,吞吐量也更高,可以充分提供电导参数。因此,这一计算管道可应用于不同的 iPSC-CM 制备,以确定细胞系特异性离子通道特性,并了解扰动反应变化背后的机制。
{"title":"Creating cell-specific computational models of stem cell-derived cardiomyocytes using optical experiments","authors":"Janice Yang, Neil J. Daily, Taylor K. Pullinger, Tetsuro Wakatsuki, Eric A. Sobie","doi":"10.1371/journal.pcbi.1011806","DOIUrl":"https://doi.org/10.1371/journal.pcbi.1011806","url":null,"abstract":"Human induced pluripotent stem cell-derived cardiomyocytes (iPSC-CMs) have gained traction as a powerful model in cardiac disease and therapeutics research, since iPSCs are self-renewing and can be derived from healthy and diseased patients without invasive surgery. However, current iPSC-CM differentiation methods produce cardiomyocytes with immature, fetal-like electrophysiological phenotypes, and the variety of maturation protocols in the literature results in phenotypic differences between labs. Heterogeneity of iPSC donor genetic backgrounds contributes to additional phenotypic variability. Several mathematical models of iPSC-CM electrophysiology have been developed to help to predict cell responses, but these models individually do not capture the phenotypic variability observed in iPSC-CMs. Here, we tackle these limitations by developing a computational pipeline to calibrate cell preparation-specific iPSC-CM electrophysiological parameters. We used the genetic algorithm (GA), a heuristic parameter calibration method, to tune ion channel parameters in a mathematical model of iPSC-CM physiology. To systematically optimize an experimental protocol that generates sufficient data for parameter calibration, we created <jats:italic>in silico</jats:italic> datasets by simulating various protocols applied to a population of models with known conductance variations, and then fitted parameters to those datasets. We found that calibrating to voltage and calcium transient data under 3 varied experimental conditions, including electrical pacing combined with ion channel blockade and changing buffer ion concentrations, improved model parameter estimates and model predictions of unseen channel block responses. This observation also held when the fitted data were normalized, suggesting that normalized fluorescence recordings, which are more accessible and higher throughput than patch clamp recordings, could sufficiently inform conductance parameters. Therefore, this computational pipeline can be applied to different iPSC-CM preparations to determine cell line-specific ion channel properties and understand the mechanisms behind variability in perturbation responses.","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142225043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Oscillations in an artificial neural network convert competing inputs into a temporal code 人工神经网络中的振荡将竞争输入转化为时间代码
IF 4.3 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-11 DOI: 10.1371/journal.pcbi.1012429
Katharina Duecker, Marco Idiart, Marcel van Gerven, Ole Jensen
The field of computer vision has long drawn inspiration from neuroscientific studies of the human and non-human primate visual system. The development of convolutional neural networks (CNNs), for example, was informed by the properties of simple and complex cells in early visual cortex. However, the computational relevance of oscillatory dynamics experimentally observed in the visual system are typically not considered in artificial neural networks (ANNs). Computational models of neocortical dynamics, on the other hand, rarely take inspiration from computer vision. Here, we combine methods from computational neuroscience and machine learning to implement multiplexing in a simple ANN using oscillatory dynamics. We first trained the network to classify individually presented letters. Post-training, we added temporal dynamics to the hidden layer, introducing refraction in the hidden units as well as pulsed inhibition mimicking neuronal alpha oscillations. Without these dynamics, the trained network correctly classified individual letters but produced a mixed output when presented with two letters simultaneously, indicating a bottleneck problem. When introducing refraction and oscillatory inhibition, the output nodes corresponding to the two stimuli activate sequentially, ordered along the phase of the inhibitory oscillations. Our model implements the idea that inhibitory oscillations segregate competing inputs in time. The results of our simulations pave the way for applications in deeper network architectures and more complicated machine learning problems.
长期以来,计算机视觉领域一直从人类和非人灵长类视觉系统的神经科学研究中汲取灵感。例如,卷积神经网络(CNN)的开发就借鉴了早期视觉皮层中简单和复杂细胞的特性。然而,人工神经网络(ANN)通常不会考虑在视觉系统中实验观察到的振荡动态的计算相关性。另一方面,新皮质动力学的计算模型很少从计算机视觉中获得灵感。在这里,我们结合了计算神经科学和机器学习的方法,利用振荡动力学在简单的人工神经网络中实现了多路复用。我们首先训练网络对单独呈现的字母进行分类。训练结束后,我们在隐藏层中添加了时间动态,在隐藏单元中引入折射以及模仿神经元阿尔法振荡的脉冲抑制。在没有这些动态效果的情况下,训练后的网络能正确地对单个字母进行分类,但当同时出现两个字母时,则会产生混合输出,这表明存在瓶颈问题。当引入折射和振荡抑制时,与两个刺激相对应的输出节点会按照抑制振荡的相位顺序依次激活。我们的模型实现了抑制性振荡在时间上隔离竞争输入的想法。我们的模拟结果为应用于更深的网络架构和更复杂的机器学习问题铺平了道路。
{"title":"Oscillations in an artificial neural network convert competing inputs into a temporal code","authors":"Katharina Duecker, Marco Idiart, Marcel van Gerven, Ole Jensen","doi":"10.1371/journal.pcbi.1012429","DOIUrl":"https://doi.org/10.1371/journal.pcbi.1012429","url":null,"abstract":"The field of computer vision has long drawn inspiration from neuroscientific studies of the human and non-human primate visual system. The development of convolutional neural networks (CNNs), for example, was informed by the properties of simple and complex cells in early visual cortex. However, the computational relevance of oscillatory dynamics experimentally observed in the visual system are typically not considered in artificial neural networks (ANNs). Computational models of neocortical dynamics, on the other hand, rarely take inspiration from computer vision. Here, we combine methods from computational neuroscience and machine learning to implement multiplexing in a simple ANN using oscillatory dynamics. We first trained the network to classify individually presented letters. Post-training, we added temporal dynamics to the hidden layer, introducing refraction in the hidden units as well as pulsed inhibition mimicking neuronal alpha oscillations. Without these dynamics, the trained network correctly classified individual letters but produced a mixed output when presented with two letters simultaneously, indicating a bottleneck problem. When introducing refraction and oscillatory inhibition, the output nodes corresponding to the two stimuli activate sequentially, ordered along the phase of the inhibitory oscillations. Our model implements the idea that inhibitory oscillations segregate competing inputs in time. The results of our simulations pave the way for applications in deeper network architectures and more complicated machine learning problems.","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The primacy model and the structure of olfactory space 首要模式和嗅觉空间结构
IF 4.3 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-10 DOI: 10.1371/journal.pcbi.1012379
Hamza Giaffar, Sergey Shuvaev, Dmitry Rinberg, Alexei A. Koulakov
Understanding sensory processing relies on the establishment of a consistent relationship between the stimulus space, its neural representation, and perceptual quality. In olfaction, the difficulty in establishing these links lies partly in the complexity of the underlying odor input space and perceptual responses. Based on the recently proposed primacy model for concentration invariant odor identity representation and a few assumptions, we have developed a theoretical framework for mapping the odor input space to the response properties of olfactory receptors. We analyze a geometrical structure containing odor representations in a multidimensional space of receptor affinities and describe its low-dimensional implementation, the primacy hull. We propose the implications of the primacy hull for the structure of feedforward connectivity in early olfactory networks. We test the predictions of our theory by comparing the existing receptor-ligand affinity and connectivity data obtained in the fruit fly olfactory system. We find that the Kenyon cells of the insect mushroom body integrate inputs from the high-affinity (primacy) sets of olfactory receptors in agreement with the primacy theory.
对感觉处理的理解有赖于在刺激空间、其神经表征和知觉质量之间建立一致的关系。在嗅觉中,建立这些联系的困难部分在于基本气味输入空间和知觉反应的复杂性。基于最近提出的浓度不变气味特征表征的首要模型和一些假设,我们建立了一个理论框架,用于映射气味输入空间和嗅觉受体的反应特性。我们分析了在受体亲和力的多维空间中包含气味表征的几何结构,并描述了其低维实现方式,即 primacy hull。我们提出了primacy hull 对早期嗅觉网络中前馈连接结构的影响。我们通过比较在果蝇嗅觉系统中获得的现有受体配体亲和力和连接性数据,检验了我们理论的预测。我们发现,昆虫蘑菇体内的肯尼恩细胞整合了来自高亲和力(初级)嗅觉受体组的输入,这与初级理论是一致的。
{"title":"The primacy model and the structure of olfactory space","authors":"Hamza Giaffar, Sergey Shuvaev, Dmitry Rinberg, Alexei A. Koulakov","doi":"10.1371/journal.pcbi.1012379","DOIUrl":"https://doi.org/10.1371/journal.pcbi.1012379","url":null,"abstract":"Understanding sensory processing relies on the establishment of a consistent relationship between the stimulus space, its neural representation, and perceptual quality. In olfaction, the difficulty in establishing these links lies partly in the complexity of the underlying odor input space and perceptual responses. Based on the recently proposed primacy model for concentration invariant odor identity representation and a few assumptions, we have developed a theoretical framework for mapping the odor input space to the response properties of olfactory receptors. We analyze a geometrical structure containing odor representations in a multidimensional space of receptor affinities and describe its low-dimensional implementation, the primacy hull. We propose the implications of the primacy hull for the structure of feedforward connectivity in early olfactory networks. We test the predictions of our theory by comparing the existing receptor-ligand affinity and connectivity data obtained in the fruit fly olfactory system. We find that the Kenyon cells of the insect mushroom body integrate inputs from the high-affinity (primacy) sets of olfactory receptors in agreement with the primacy theory.","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Construct prognostic models of multiple myeloma with pathway information incorporated 构建包含路径信息的多发性骨髓瘤预后模型
IF 4.3 2区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS Pub Date : 2024-09-10 DOI: 10.1371/journal.pcbi.1012444
Shuo Wang, ShanJin Wang, Wei Pan, YuYang Yi, Junyan Lu
Multiple myeloma (MM) is a hematological disease exhibiting aberrant clonal expansion of cancerous plasma cells in the bone marrow. The effects of treatments for MM vary between patients, highlighting the importance of developing prognostic models for informed therapeutic decision-making. Most previous models were constructed at the gene level, ignoring the fact that the dysfunction of the pathway is closely associated with disease development and progression. The present study considered two strategies that construct predictive models by taking pathway information into consideration: pathway score method and group lasso using pathway information. The former simply converted gene expression to sample-wise pathway scores for model fitting. We considered three methods for pathway score calculation (ssGSEA, GSVA, and z-scores) and 14 data sources providing pathway information. We implemented these methods in microarray data for MM (GSE136324) and obtained a candidate model with the best prediction performance in interval validation. The candidate model is further compared with the gene-based model and previously published models in two external data. We also investigated the effects of missing values on prediction. The results showed that group lasso incorporating Vax pathway information (Vax(grp)) was more competitive in prediction than the gene model in both internal and external validation. Immune information, including VAX pathways, seemed to be more predictive for MM. Vax(grp) also outperformed the previously published models. Moreover, the new model was more resistant to missing values, and the presence of missing values (<5%) would not evidently deteriorate its prediction accuracy using our missing data imputation method. In a nutshell, pathway-based models (using group lasso) were competitive alternatives to gene-based models for MM. These models were documented in an R package (https://github.com/ShuoStat/MMMs), where a missing data imputation method was also integrated to facilitate future validation.
多发性骨髓瘤(MM)是一种血液病,表现为骨髓中癌变浆细胞的异常克隆扩增。对 MM 的治疗效果因人而异,这凸显了开发预后模型以做出明智治疗决策的重要性。以前的大多数模型都是在基因水平上构建的,忽略了通路的功能障碍与疾病的发生和发展密切相关这一事实。本研究考虑了两种通过考虑通路信息构建预测模型的策略:通路评分法和利用通路信息的组套索法。前者简单地将基因表达转化为样本意义上的通路得分,用于模型拟合。我们考虑了三种通路得分计算方法(ssGSEA、GSVA 和 z-scores)和 14 个提供通路信息的数据源。我们在 MM 的微阵列数据(GSE136324)中实施了这些方法,并在区间验证中获得了预测性能最佳的候选模型。在两个外部数据中,候选模型进一步与基于基因的模型和以前发表的模型进行了比较。我们还研究了缺失值对预测的影响。结果表明,在内部和外部验证中,包含 Vax 通路信息的组套索(Vax(grp))在预测方面比基因模型更有竞争力。包括 VAX 通路在内的免疫信息似乎更能预测 MM。Vax(grp)的表现也优于之前发表的模型。此外,新模型对缺失值的抵抗力更强,使用我们的缺失数据估算方法,缺失值(5%)的存在不会明显降低其预测准确性。简而言之,基于通路的模型(使用组套索)是基于基因的 MM 模型的竞争性替代品。这些模型被记录在一个 R 软件包(https://github.com/ShuoStat/MMMs)中,其中还集成了缺失数据估算方法,以方便将来的验证。
{"title":"Construct prognostic models of multiple myeloma with pathway information incorporated","authors":"Shuo Wang, ShanJin Wang, Wei Pan, YuYang Yi, Junyan Lu","doi":"10.1371/journal.pcbi.1012444","DOIUrl":"https://doi.org/10.1371/journal.pcbi.1012444","url":null,"abstract":"Multiple myeloma (MM) is a hematological disease exhibiting aberrant clonal expansion of cancerous plasma cells in the bone marrow. The effects of treatments for MM vary between patients, highlighting the importance of developing prognostic models for informed therapeutic decision-making. Most previous models were constructed at the gene level, ignoring the fact that the dysfunction of the pathway is closely associated with disease development and progression. The present study considered two strategies that construct predictive models by taking pathway information into consideration: pathway score method and group lasso using pathway information. The former simply converted gene expression to sample-wise pathway scores for model fitting. We considered three methods for pathway score calculation (ssGSEA, GSVA, and z-scores) and 14 data sources providing pathway information. We implemented these methods in microarray data for MM (GSE136324) and obtained a candidate model with the best prediction performance in interval validation. The candidate model is further compared with the gene-based model and previously published models in two external data. We also investigated the effects of missing values on prediction. The results showed that group lasso incorporating Vax pathway information (Vax(grp)) was more competitive in prediction than the gene model in both internal and external validation. Immune information, including VAX pathways, seemed to be more predictive for MM. Vax(grp) also outperformed the previously published models. Moreover, the new model was more resistant to missing values, and the presence of missing values (&lt;5%) would not evidently deteriorate its prediction accuracy using our missing data imputation method. In a nutshell, pathway-based models (using group lasso) were competitive alternatives to gene-based models for MM. These models were documented in an R package (<jats:ext-link xmlns:xlink=\"http://www.w3.org/1999/xlink\" ext-link-type=\"uri\" xlink:href=\"https://github.com/ShuoStat/MMMs\" xlink:type=\"simple\">https://github.com/ShuoStat/MMMs</jats:ext-link>), where a missing data imputation method was also integrated to facilitate future validation.","PeriodicalId":20241,"journal":{"name":"PLoS Computational Biology","volume":null,"pages":null},"PeriodicalIF":4.3,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
PLoS Computational Biology
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1