首页 > 最新文献

Patterns最新文献

英文 中文
Generating and leveraging explanations of AI/ML models in materials and manufacturing research. 在材料和制造研究中生成和利用AI/ML模型的解释。
IF 7.4 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-11 eCollection Date: 2025-09-12 DOI: 10.1016/j.patter.2025.101340
Erick J Braham, Jennifer M Ruddock, James O Hardin

In some technical domains, machine learning (ML) tools, typically used with large datasets, must be adapted to small datasets, opaque design spaces, and expensive data generation. Specifically, generating data in many materials or manufacturing contexts can be expensive in time, materials, and expertise. Additionally, the "thought process" of complex "black box" ML models is often obscure to key stakeholders. This limitation can result in inefficient or dangerous predictions when errors in data processing or model training go unnoticed. Methods of generating human-interpretable explanations of complex models, called explainable artificial intelligence (XAI), can provide the insight needed to prevent these problems. In this review, we briefly present XAI methods and outline how XAI can also inform future behavior. These examples illustrate how XAI can improve manufacturing output, physical understanding, and feature engineering. We present guidance on using XAI in materials science and manufacturing research with the aid of demonstrative examples from literature.

在某些技术领域,通常用于大型数据集的机器学习(ML)工具必须适应小型数据集、不透明的设计空间和昂贵的数据生成。具体来说,在许多材料或制造环境中生成数据在时间、材料和专业知识上都是昂贵的。此外,复杂的“黑盒”ML模型的“思维过程”通常对关键涉众来说是模糊的。当数据处理或模型训练中的错误未被注意到时,这种限制可能导致低效率或危险的预测。生成复杂模型的人类可解释解释的方法,称为可解释人工智能(explainable artificial intelligence, XAI),可以提供防止这些问题所需的洞察力。在这篇综述中,我们简要介绍了XAI方法,并概述了XAI如何为未来的行为提供信息。这些示例说明了XAI如何改进制造输出、物理理解和特征工程。本文结合文献中的实例,提出了在材料科学和制造研究中应用XAI的指导思想。
{"title":"Generating and leveraging explanations of AI/ML models in materials and manufacturing research.","authors":"Erick J Braham, Jennifer M Ruddock, James O Hardin","doi":"10.1016/j.patter.2025.101340","DOIUrl":"10.1016/j.patter.2025.101340","url":null,"abstract":"<p><p>In some technical domains, machine learning (ML) tools, typically used with large datasets, must be adapted to small datasets, opaque design spaces, and expensive data generation. Specifically, generating data in many materials or manufacturing contexts can be expensive in time, materials, and expertise. Additionally, the \"thought process\" of complex \"black box\" ML models is often obscure to key stakeholders. This limitation can result in inefficient or dangerous predictions when errors in data processing or model training go unnoticed. Methods of generating human-interpretable explanations of complex models, called explainable artificial intelligence (XAI), can provide the insight needed to prevent these problems. In this review, we briefly present XAI methods and outline how XAI can also inform future behavior. These examples illustrate how XAI can improve manufacturing output, physical understanding, and feature engineering. We present guidance on using XAI in materials science and manufacturing research with the aid of demonstrative examples from literature.</p>","PeriodicalId":36242,"journal":{"name":"Patterns","volume":"6 9","pages":"101340"},"PeriodicalIF":7.4,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12485511/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145214014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cite what you read, read what you cite. 引用你所读的,读你所引用的。
IF 7.4 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-08 DOI: 10.1016/j.patter.2025.101344
Andrew L Hufton
{"title":"Cite what you read, read what you cite.","authors":"Andrew L Hufton","doi":"10.1016/j.patter.2025.101344","DOIUrl":"10.1016/j.patter.2025.101344","url":null,"abstract":"","PeriodicalId":36242,"journal":{"name":"Patterns","volume":"6 8","pages":"101344"},"PeriodicalIF":7.4,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12365504/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144972412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Uncovering drivers of climate research in policy with pretrained language models. 用预训练的语言模型揭示气候研究在政策中的驱动因素。
IF 7.4 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-07 eCollection Date: 2025-11-14 DOI: 10.1016/j.patter.2025.101342
Basil Mahfouz, Licia Capra, Geoff Mulgan

Evidence-based policymaking is crucial for addressing societal challenges, yet factors driving research uptake in policy remain unclear. Previous studies have not accounted for the confounding effect of policy relevance, potentially skewing conclusions about impact drivers. Using climate change as a case study, we employ pretrained language models to identify semantically similar research paper pairs where one is cited in policy and the other is not, controlling for inherent policy relevance. This approach allows us to isolate the effects of various factors on policy citation likelihood. We find that in climate change, academic citations are the strongest predictor of policy impact, followed by media mentions. This computational method can be extended to other variables as well as different scientific domains to enable comparative analysis of policy uptake mechanisms across fields.

基于证据的政策制定对于解决社会挑战至关重要,然而推动研究纳入政策的因素尚不清楚。以前的研究没有考虑到政策相关性的混淆效应,可能会歪曲有关影响驱动因素的结论。以气候变化为例,我们采用预先训练的语言模型来识别语义相似的研究论文对,其中一篇在政策中被引用,另一篇没有,控制内在的政策相关性。这种方法使我们能够隔离各种因素对政策引用可能性的影响。我们发现,在气候变化中,学术引用是政策影响的最强预测因子,其次是媒体提及。这种计算方法可以扩展到其他变量以及不同的科学领域,以便对跨领域的政策吸收机制进行比较分析。
{"title":"Uncovering drivers of climate research in policy with pretrained language models.","authors":"Basil Mahfouz, Licia Capra, Geoff Mulgan","doi":"10.1016/j.patter.2025.101342","DOIUrl":"10.1016/j.patter.2025.101342","url":null,"abstract":"<p><p>Evidence-based policymaking is crucial for addressing societal challenges, yet factors driving research uptake in policy remain unclear. Previous studies have not accounted for the confounding effect of policy relevance, potentially skewing conclusions about impact drivers. Using climate change as a case study, we employ pretrained language models to identify semantically similar research paper pairs where one is cited in policy and the other is not, controlling for inherent policy relevance. This approach allows us to isolate the effects of various factors on policy citation likelihood. We find that in climate change, academic citations are the strongest predictor of policy impact, followed by media mentions. This computational method can be extended to other variables as well as different scientific domains to enable comparative analysis of policy uptake mechanisms across fields.</p>","PeriodicalId":36242,"journal":{"name":"Patterns","volume":"6 11","pages":"101342"},"PeriodicalIF":7.4,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12664962/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145655734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Combined statistical-biophysical modeling links ion channel genes to physiology of cortical neuron types. 结合统计-生物物理模型将离子通道基因与皮层神经元类型的生理联系起来。
IF 7.4 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-05 eCollection Date: 2025-10-10 DOI: 10.1016/j.patter.2025.101323
Yves Bernaerts, Michael Deistler, Pedro J Gonçalves, Jonas Beck, Marcel Stimberg, Federico Scala, Andreas S Tolias, Jakob H Macke, Dmitry Kobak, Philipp Berens

Neurons have classically been characterized by their anatomy, electrophysiology, and molecular markers. More recently, single-cell transcriptomics has enabled an increasingly fine genetically defined taxonomy of cortical cell types, but the link between the gene expression of individual cell types and their physiological and anatomical properties remains poorly understood. Here, we develop a hybrid modeling approach to bridge this gap: our approach combines statistical and mechanistic models to predict cells' electrophysiological activity from gene expression patterns. To this end, we fit Hodgkin-Huxley-based models for a wide variety of cortical cell types by using simulation-based inference while overcoming the mismatch between model and data. Using multimodal Patch-seq data, we link the estimated model parameters to gene expression using an interpretable linear sparse regression model. Our approach identifies the expression of specific ion channel genes as predictive of biophysical model parameters including ion channel densities, implicating their mechanistic role in determining neural firing properties.

神经元的典型特征是解剖学、电生理学和分子标记。最近,单细胞转录组学使皮质细胞类型的遗传分类越来越精细,但个体细胞类型的基因表达与其生理和解剖特性之间的联系仍然知之甚少。在这里,我们开发了一种混合建模方法来弥补这一差距:我们的方法结合了统计和机制模型,从基因表达模式预测细胞的电生理活动。为此,我们在克服模型和数据不匹配的同时,通过基于模拟的推理,将基于霍奇金-赫胥黎的模型拟合为各种各样的皮层细胞类型。使用多模态Patch-seq数据,我们使用可解释的线性稀疏回归模型将估计的模型参数与基因表达联系起来。我们的方法确定了特定离子通道基因的表达,作为生物物理模型参数(包括离子通道密度)的预测,暗示了它们在决定神经放电特性中的机制作用。
{"title":"Combined statistical-biophysical modeling links ion channel genes to physiology of cortical neuron types.","authors":"Yves Bernaerts, Michael Deistler, Pedro J Gonçalves, Jonas Beck, Marcel Stimberg, Federico Scala, Andreas S Tolias, Jakob H Macke, Dmitry Kobak, Philipp Berens","doi":"10.1016/j.patter.2025.101323","DOIUrl":"10.1016/j.patter.2025.101323","url":null,"abstract":"<p><p>Neurons have classically been characterized by their anatomy, electrophysiology, and molecular markers. More recently, single-cell transcriptomics has enabled an increasingly fine genetically defined taxonomy of cortical cell types, but the link between the gene expression of individual cell types and their physiological and anatomical properties remains poorly understood. Here, we develop a hybrid modeling approach to bridge this gap: our approach combines statistical and mechanistic models to predict cells' electrophysiological activity from gene expression patterns. To this end, we fit Hodgkin-Huxley-based models for a wide variety of cortical cell types by using simulation-based inference while overcoming the mismatch between model and data. Using multimodal Patch-seq data, we link the estimated model parameters to gene expression using an interpretable linear sparse regression model. Our approach identifies the expression of specific ion channel genes as predictive of biophysical model parameters including ion channel densities, implicating their mechanistic role in determining neural firing properties.</p>","PeriodicalId":36242,"journal":{"name":"Patterns","volume":"6 10","pages":"101323"},"PeriodicalIF":7.4,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12546760/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145372432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lessons from complex systems science for AI governance. 复杂系统科学对人工智能治理的启示。
IF 7.4 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-01 eCollection Date: 2025-08-08 DOI: 10.1016/j.patter.2025.101341
Noam Kolt, Michal Shur-Ofry, Reuven Cohen

The study of complex adaptive systems, pioneered in physics, biology, and the social sciences, offers important lessons for artificial intelligence (AI) governance. Contemporary AI systems and the environments in which they operate exhibit many of the properties characteristic of complex systems, including nonlinear growth patterns, emergent phenomena, and cascading effects that can lead to catastrophic failures. Complex systems science can help illuminate the features of AI that pose central challenges for policymakers, such as feedback loops induced by training AI models on synthetic data and the interconnectedness between AI systems and critical infrastructure. Drawing on insights from other domains shaped by complex systems, including public health and climate change, we examine how efforts to govern AI are marked by deep uncertainty. To contend with this challenge, we propose three desiderata for designing a set of complexity-compatible AI governance principles comprised of early and scalable intervention, adaptive institutional design, and risk thresholds calibrated to trigger timely and effective regulatory responses.

在物理学、生物学和社会科学领域开创的复杂适应系统研究,为人工智能(AI)治理提供了重要的经验教训。当代人工智能系统及其运行环境表现出复杂系统的许多特性,包括非线性增长模式、紧急现象和可能导致灾难性故障的级联效应。复杂系统科学可以帮助阐明人工智能的特征,这些特征对政策制定者构成了核心挑战,比如在合成数据上训练人工智能模型所引发的反馈回路,以及人工智能系统与关键基础设施之间的相互联系。借鉴公共卫生和气候变化等复杂系统塑造的其他领域的见解,我们研究了治理人工智能的努力是如何被深刻的不确定性所标志的。为了应对这一挑战,我们提出了设计一套复杂性兼容的人工智能治理原则的三个理想条件,包括早期和可扩展的干预、适应性制度设计和校准的风险阈值,以触发及时有效的监管响应。
{"title":"Lessons from complex systems science for AI governance.","authors":"Noam Kolt, Michal Shur-Ofry, Reuven Cohen","doi":"10.1016/j.patter.2025.101341","DOIUrl":"10.1016/j.patter.2025.101341","url":null,"abstract":"<p><p>The study of complex adaptive systems, pioneered in physics, biology, and the social sciences, offers important lessons for artificial intelligence (AI) governance. Contemporary AI systems and the environments in which they operate exhibit many of the properties characteristic of complex systems, including nonlinear growth patterns, emergent phenomena, and cascading effects that can lead to catastrophic failures. Complex systems science can help illuminate the features of AI that pose central challenges for policymakers, such as feedback loops induced by training AI models on synthetic data and the interconnectedness between AI systems and critical infrastructure. Drawing on insights from other domains shaped by complex systems, including public health and climate change, we examine how efforts to govern AI are marked by deep uncertainty. To contend with this challenge, we propose three desiderata for designing a set of complexity-compatible AI governance principles comprised of early and scalable intervention, adaptive institutional design, and risk thresholds calibrated to trigger timely and effective regulatory responses.</p>","PeriodicalId":36242,"journal":{"name":"Patterns","volume":"6 8","pages":"101341"},"PeriodicalIF":7.4,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12365527/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144972451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BioLLM: A standardized framework for integrating and benchmarking single-cell foundation models. BioLLM:用于集成和基准测试单细胞基础模型的标准化框架。
IF 7.4 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-07-30 eCollection Date: 2025-08-08 DOI: 10.1016/j.patter.2025.101326
Ping Qiu, Qianqian Chen, Hua Qin, Shuangsang Fang, Yilin Zhang, Yanlin Zhang, Tianyi Xia, Lei Cao, Yong Zhang, Xiaodong Fang, Yuxiang Li, Luni Hu

The application and evaluation of single-cell foundation models (scFMs) present significant challenges due to heterogeneous architectures and coding standards. To address this, we introduce BioLLM (biological large language model), a unified framework for integrating and applying scFMs to single-cell RNA sequencing analysis. BioLLM provides a unified interface that integrates diverse scFMs, eliminating architectural and coding inconsistencies to enable streamlined model access. With standardized APIs and comprehensive documentation, BioLLM supports streamlined model switching and consistent benchmarking. Our comprehensive evaluation of scFMs revealed distinct strengths and limitations, highlighting scGPT's robust performance across all tasks, including zero shot and fine-tuning. Geneformer and scFoundation demonstrated strong capabilities in gene-level tasks, benefiting from effective pretraining strategies. In contrast, scBERT lagged behind, likely due to its smaller model size and limited training data. Ultimately, BioLLM aims to empower the scientific community to leverage the full potential of foundational models, advancing our understanding of complex biological systems through enhanced single-cell analysis.

由于异构的体系结构和编码标准,单细胞基础模型(scFMs)的应用和评估面临重大挑战。为了解决这个问题,我们引入了BioLLM(生物大语言模型),这是一个统一的框架,用于整合和应用scFMs进行单细胞RNA测序分析。BioLLM提供了一个统一的接口,集成了不同的scfm,消除了架构和编码的不一致性,从而实现了简化的模型访问。通过标准化的api和全面的文档,BioLLM支持简化的模型切换和一致的基准测试。我们对scFMs的综合评估揭示了其独特的优势和局限性,突出了scGPT在所有任务中的稳健性能,包括零射击和微调。得益于有效的预训练策略,Geneformer和scFoundation在基因级任务中表现出了强大的能力。相比之下,scBERT落后了,可能是由于其较小的模型尺寸和有限的训练数据。最终,BioLLM旨在使科学界能够充分利用基础模型的全部潜力,通过增强单细胞分析来推进我们对复杂生物系统的理解。
{"title":"BioLLM: A standardized framework for integrating and benchmarking single-cell foundation models.","authors":"Ping Qiu, Qianqian Chen, Hua Qin, Shuangsang Fang, Yilin Zhang, Yanlin Zhang, Tianyi Xia, Lei Cao, Yong Zhang, Xiaodong Fang, Yuxiang Li, Luni Hu","doi":"10.1016/j.patter.2025.101326","DOIUrl":"10.1016/j.patter.2025.101326","url":null,"abstract":"<p><p>The application and evaluation of single-cell foundation models (scFMs) present significant challenges due to heterogeneous architectures and coding standards. To address this, we introduce BioLLM (biological large language model), a unified framework for integrating and applying scFMs to single-cell RNA sequencing analysis. BioLLM provides a unified interface that integrates diverse scFMs, eliminating architectural and coding inconsistencies to enable streamlined model access. With standardized APIs and comprehensive documentation, BioLLM supports streamlined model switching and consistent benchmarking. Our comprehensive evaluation of scFMs revealed distinct strengths and limitations, highlighting scGPT's robust performance across all tasks, including zero shot and fine-tuning. Geneformer and scFoundation demonstrated strong capabilities in gene-level tasks, benefiting from effective pretraining strategies. In contrast, scBERT lagged behind, likely due to its smaller model size and limited training data. Ultimately, BioLLM aims to empower the scientific community to leverage the full potential of foundational models, advancing our understanding of complex biological systems through enhanced single-cell analysis.</p>","PeriodicalId":36242,"journal":{"name":"Patterns","volume":"6 8","pages":"101326"},"PeriodicalIF":7.4,"publicationDate":"2025-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12365531/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144972404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A one-shot, lossless algorithm for cross-cohort learning in mixed-outcomes analysis. 混合结果分析中跨队列学习的一次性无损算法。
IF 7.4 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-07-30 eCollection Date: 2025-09-12 DOI: 10.1016/j.patter.2025.101321
Ruowang Li, Luke Benz, Rui Duan, Joshua C Denny, Hakon Hakonarson, Jonathan D Mosley, Jordan W Smoller, Wei-Qi Wei, Thomas Lumley, Marylyn D Ritchie, Jason H Moore, Yong Chen

In cross-cohort studies, integrating diverse datasets is essential and challenging due to cohort-specific variations, distributed data storage, and privacy concerns. Traditional methods often require data pooling or harmonization, which can reduce efficiency and limit the scope of cross-cohort learning. We introduce mixWAS, a one-shot, lossless algorithm that efficiently integrates distributed electronic health record (EHR) datasets via summary statistics. Unlike existing approaches, mixWAS preserves cohort-specific covariate associations and supports simultaneous mixed-outcome analyses. Simulations demonstrate that mixWAS outperforms conventional methods in accuracy and efficiency across various scenarios. Applied to EHR data from seven cohorts in the US, mixWAS identified 4,530 significant cross-cohort genetic associations among traits such as blood lipids, BMI, and circulatory diseases. Validation with an independent UK EHR dataset confirmed 97.7% of these associations, underscoring the algorithm's robustness. By enabling lossless cross-cohort integration, mixWAS improves the precision of multi-outcome analyses and expands the potential for actionable insights in healthcare research.

在跨队列研究中,由于队列特定的变化、分布式数据存储和隐私问题,集成不同的数据集是必不可少的,也是具有挑战性的。传统的方法通常需要数据池或协调,这会降低效率并限制跨队列学习的范围。我们介绍mixWAS,这是一种一次性无损算法,通过汇总统计有效地集成分布式电子健康记录(EHR)数据集。与现有方法不同,mixWAS保留了特定队列的协变量关联,并支持同时进行混合结果分析。仿真结果表明,mixWAS在各种场景下的精度和效率都优于传统方法。mixWAS应用于美国7个队列的电子病历数据,在血脂、BMI和循环系统疾病等特征之间确定了4530个显著的跨队列遗传关联。独立的英国EHR数据集验证证实了97.7%的关联,强调了算法的稳健性。通过实现无损的跨队列整合,mixWAS提高了多结果分析的准确性,并扩大了医疗保健研究中可操作见解的潜力。
{"title":"A one-shot, lossless algorithm for cross-cohort learning in mixed-outcomes analysis.","authors":"Ruowang Li, Luke Benz, Rui Duan, Joshua C Denny, Hakon Hakonarson, Jonathan D Mosley, Jordan W Smoller, Wei-Qi Wei, Thomas Lumley, Marylyn D Ritchie, Jason H Moore, Yong Chen","doi":"10.1016/j.patter.2025.101321","DOIUrl":"10.1016/j.patter.2025.101321","url":null,"abstract":"<p><p>In cross-cohort studies, integrating diverse datasets is essential and challenging due to cohort-specific variations, distributed data storage, and privacy concerns. Traditional methods often require data pooling or harmonization, which can reduce efficiency and limit the scope of cross-cohort learning. We introduce mixWAS, a one-shot, lossless algorithm that efficiently integrates distributed electronic health record (EHR) datasets via summary statistics. Unlike existing approaches, mixWAS preserves cohort-specific covariate associations and supports simultaneous mixed-outcome analyses. Simulations demonstrate that mixWAS outperforms conventional methods in accuracy and efficiency across various scenarios. Applied to EHR data from seven cohorts in the US, mixWAS identified 4,530 significant cross-cohort genetic associations among traits such as blood lipids, BMI, and circulatory diseases. Validation with an independent UK EHR dataset confirmed 97.7% of these associations, underscoring the algorithm's robustness. By enabling lossless cross-cohort integration, mixWAS improves the precision of multi-outcome analyses and expands the potential for actionable insights in healthcare research.</p>","PeriodicalId":36242,"journal":{"name":"Patterns","volume":"6 9","pages":"101321"},"PeriodicalIF":7.4,"publicationDate":"2025-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12485519/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145214044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A consensus privacy metrics framework for synthetic data. 合成数据的共识隐私度量框架。
IF 7.4 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-07-29 eCollection Date: 2025-10-10 DOI: 10.1016/j.patter.2025.101320
Lisa Pilgram, Fida Kamal Dankar, Jörg Drechsler, Mark Elliot, Josep Domingo-Ferrer, Paul Francis, Murat Kantarcioglu, Linglong Kong, Bradley Malin, Krishnamurty Muralidhar, Puja Myles, Fabian Prasser, Jean Louis Raisaro, Chao Yan, Khaled El Emam

Synthetic data generation is a promising approach for sharing data for secondary purposes in sensitive sectors. However, to meet ethical standards and legislative requirements, it is necessary to demonstrate that the privacy of the individuals upon which the synthetic records are based is adequately protected. Through an expert consensus process, we developed a framework for privacy evaluation in synthetic data. The most commonly used metrics measure similarity between real and synthetic data and are assumed to capture identity disclosure. Our findings indicate that they lack precise interpretation and should be avoided. There was consensus on the importance of membership and attribute disclosure, both of which involve inferring personal information. The framework provides recommendations to effectively measure these types of disclosures, which also apply to differentially private synthetic data if the privacy budget is not close to zero. We further present future research opportunities to support widespread adoption of synthetic data.

合成数据生成是一种很有前途的方法,可以在敏感部门共享用于次要目的的数据。然而,为了符合道德标准和立法要求,有必要证明合成记录所依据的个人隐私得到充分保护。通过专家共识过程,我们开发了一个用于合成数据隐私评估的框架。最常用的度量标准衡量真实数据和合成数据之间的相似性,并被认为可以捕获身份披露。我们的研究结果表明,它们缺乏精确的解释,应该避免。对隶属关系和属性披露的重要性有共识,两者都涉及个人信息的推断。该框架提供了有效衡量这些类型的披露的建议,如果隐私预算不接近于零,这些建议也适用于不同的私有合成数据。我们进一步提出未来的研究机会,以支持广泛采用合成数据。
{"title":"A consensus privacy metrics framework for synthetic data.","authors":"Lisa Pilgram, Fida Kamal Dankar, Jörg Drechsler, Mark Elliot, Josep Domingo-Ferrer, Paul Francis, Murat Kantarcioglu, Linglong Kong, Bradley Malin, Krishnamurty Muralidhar, Puja Myles, Fabian Prasser, Jean Louis Raisaro, Chao Yan, Khaled El Emam","doi":"10.1016/j.patter.2025.101320","DOIUrl":"10.1016/j.patter.2025.101320","url":null,"abstract":"<p><p>Synthetic data generation is a promising approach for sharing data for secondary purposes in sensitive sectors. However, to meet ethical standards and legislative requirements, it is necessary to demonstrate that the privacy of the individuals upon which the synthetic records are based is adequately protected. Through an expert consensus process, we developed a framework for privacy evaluation in synthetic data. The most commonly used metrics measure similarity between real and synthetic data and are assumed to capture identity disclosure. Our findings indicate that they lack precise interpretation and should be avoided. There was consensus on the importance of membership and attribute disclosure, both of which involve inferring personal information. The framework provides recommendations to effectively measure these types of disclosures, which also apply to differentially private synthetic data if the privacy budget is not close to zero. We further present future research opportunities to support widespread adoption of synthetic data.</p>","PeriodicalId":36242,"journal":{"name":"Patterns","volume":"6 10","pages":"101320"},"PeriodicalIF":7.4,"publicationDate":"2025-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12546437/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145379152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tucano: Advancing neural text generation for Portuguese. 葡语:葡萄牙语的高级神经文本生成。
IF 7.4 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-07-23 eCollection Date: 2025-11-14 DOI: 10.1016/j.patter.2025.101325
Nicholas Kluge Corrêa, Aniket Sen, Sophia Falk, Shiza Fatimah

Natural language processing has seen substantial progress in recent years. However, current deep-learning-based language models demand extensive data and computational resources. This data-intensive paradigm has led to a divide between high-resource languages, where development is thriving, and low-resource languages, which lag behind. To address this disparity, this study introduces a new set of resources to advance neural text generation for Portuguese. Here, we document the development of GigaVerbo, a Portuguese text corpus amounting to 200 billion tokens. Using this corpus, we trained Tucano, a family of decoder-only transformer models. Our models consistently outperform comparable Portuguese and multilingual models on several benchmarks. All models, datasets, and tools developed in this work are openly available to the community to support reproducible research.

近年来,自然语言处理取得了实质性的进展。然而,目前基于深度学习的语言模型需要大量的数据和计算资源。这种数据密集型范式导致了高资源语言和低资源语言之间的鸿沟,前者的开发正在蓬勃发展,后者则落后。为了解决这一差异,本研究引入了一套新的资源来推进葡萄牙语的神经文本生成。在这里,我们记录了GigaVerbo的发展,这是一个葡萄牙语文本语料库,总计2000亿个代币。使用这个语料库,我们训练了Tucano,这是一个只有解码器的转换器模型家族。我们的模型在几个基准上始终优于可比的葡萄牙语和多语言模型。在这项工作中开发的所有模型、数据集和工具都对社区开放,以支持可重复的研究。
{"title":"Tucano: Advancing neural text generation for Portuguese.","authors":"Nicholas Kluge Corrêa, Aniket Sen, Sophia Falk, Shiza Fatimah","doi":"10.1016/j.patter.2025.101325","DOIUrl":"10.1016/j.patter.2025.101325","url":null,"abstract":"<p><p>Natural language processing has seen substantial progress in recent years. However, current deep-learning-based language models demand extensive data and computational resources. This data-intensive paradigm has led to a divide between high-resource languages, where development is thriving, and low-resource languages, which lag behind. To address this disparity, this study introduces a new set of resources to advance neural text generation for Portuguese. Here, we document the development of GigaVerbo, a Portuguese text corpus amounting to 200 billion tokens. Using this corpus, we trained Tucano, a family of decoder-only transformer models. Our models consistently outperform comparable Portuguese and multilingual models on several benchmarks. All models, datasets, and tools developed in this work are openly available to the community to support reproducible research.</p>","PeriodicalId":36242,"journal":{"name":"Patterns","volume":"6 11","pages":"101325"},"PeriodicalIF":7.4,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12664968/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145655771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A systematic survey of natural language processing for the Greek language. 希腊语言的自然语言处理的系统调查。
IF 7.4 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-07-21 eCollection Date: 2025-11-14 DOI: 10.1016/j.patter.2025.101313
Juli Bakagianni, Kanella Pouli, Maria Gavriilidou, John Pavlopoulos

Comprehensive monolingual natural language processing (NLP) surveys are essential for assessing language-specific challenges, resource availability, and research gaps. However, existing surveys often lack standardized methodologies, leading to selection bias and fragmented coverage of NLP tasks and resources. This study introduces a generalizable framework for systematic monolingual NLP surveys. Our approach integrates a structured search protocol to minimize bias, an NLP task taxonomy for classification, and language resource taxonomies to identify potential benchmarks and highlight opportunities for improving resource availability. We apply this framework to Greek NLP (2012-2023), providing an in-depth analysis of its current state, task-specific progress, and resource gaps. The survey results are publicly available and are regularly updated to provide an evergreen resource. This systematic survey of Greek NLP serves as a case study, demonstrating the effectiveness of our framework and its potential for broader application to other not-so-well-resourced languages as regards NLP.

综合单语自然语言处理(NLP)调查对于评估语言特定挑战、资源可用性和研究差距至关重要。然而,现有的调查往往缺乏标准化的方法,导致选择偏差和碎片化的NLP任务和资源的覆盖。本研究为系统的单语NLP调查引入了一个可推广的框架。我们的方法集成了一个结构化搜索协议,以最大限度地减少偏见,一个用于分类的NLP任务分类法,以及语言资源分类法,以确定潜在的基准,并突出提高资源可用性的机会。我们将此框架应用于希腊NLP(2012-2023),对其当前状态、特定任务进展和资源缺口进行了深入分析。调查结果是公开的,并定期更新,以提供一个常绿的资源。这个希腊NLP的系统调查作为一个案例研究,展示了我们的框架的有效性,以及它在其他资源不那么丰富的语言中更广泛应用的潜力。
{"title":"A systematic survey of natural language processing for the Greek language.","authors":"Juli Bakagianni, Kanella Pouli, Maria Gavriilidou, John Pavlopoulos","doi":"10.1016/j.patter.2025.101313","DOIUrl":"10.1016/j.patter.2025.101313","url":null,"abstract":"<p><p>Comprehensive monolingual natural language processing (NLP) surveys are essential for assessing language-specific challenges, resource availability, and research gaps. However, existing surveys often lack standardized methodologies, leading to selection bias and fragmented coverage of NLP tasks and resources. This study introduces a generalizable framework for systematic monolingual NLP surveys. Our approach integrates a structured search protocol to minimize bias, an NLP task taxonomy for classification, and language resource taxonomies to identify potential benchmarks and highlight opportunities for improving resource availability. We apply this framework to Greek NLP (2012-2023), providing an in-depth analysis of its current state, task-specific progress, and resource gaps. The survey results are publicly available and are regularly updated to provide an evergreen resource. This systematic survey of Greek NLP serves as a case study, demonstrating the effectiveness of our framework and its potential for broader application to other not-so-well-resourced languages as regards NLP.</p>","PeriodicalId":36242,"journal":{"name":"Patterns","volume":"6 11","pages":"101313"},"PeriodicalIF":7.4,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12715428/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145805594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Patterns
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1