Pub Date : 2025-08-11eCollection Date: 2025-09-12DOI: 10.1016/j.patter.2025.101340
Erick J Braham, Jennifer M Ruddock, James O Hardin
In some technical domains, machine learning (ML) tools, typically used with large datasets, must be adapted to small datasets, opaque design spaces, and expensive data generation. Specifically, generating data in many materials or manufacturing contexts can be expensive in time, materials, and expertise. Additionally, the "thought process" of complex "black box" ML models is often obscure to key stakeholders. This limitation can result in inefficient or dangerous predictions when errors in data processing or model training go unnoticed. Methods of generating human-interpretable explanations of complex models, called explainable artificial intelligence (XAI), can provide the insight needed to prevent these problems. In this review, we briefly present XAI methods and outline how XAI can also inform future behavior. These examples illustrate how XAI can improve manufacturing output, physical understanding, and feature engineering. We present guidance on using XAI in materials science and manufacturing research with the aid of demonstrative examples from literature.
{"title":"Generating and leveraging explanations of AI/ML models in materials and manufacturing research.","authors":"Erick J Braham, Jennifer M Ruddock, James O Hardin","doi":"10.1016/j.patter.2025.101340","DOIUrl":"10.1016/j.patter.2025.101340","url":null,"abstract":"<p><p>In some technical domains, machine learning (ML) tools, typically used with large datasets, must be adapted to small datasets, opaque design spaces, and expensive data generation. Specifically, generating data in many materials or manufacturing contexts can be expensive in time, materials, and expertise. Additionally, the \"thought process\" of complex \"black box\" ML models is often obscure to key stakeholders. This limitation can result in inefficient or dangerous predictions when errors in data processing or model training go unnoticed. Methods of generating human-interpretable explanations of complex models, called explainable artificial intelligence (XAI), can provide the insight needed to prevent these problems. In this review, we briefly present XAI methods and outline how XAI can also inform future behavior. These examples illustrate how XAI can improve manufacturing output, physical understanding, and feature engineering. We present guidance on using XAI in materials science and manufacturing research with the aid of demonstrative examples from literature.</p>","PeriodicalId":36242,"journal":{"name":"Patterns","volume":"6 9","pages":"101340"},"PeriodicalIF":7.4,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12485511/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145214014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-08DOI: 10.1016/j.patter.2025.101344
Andrew L Hufton
{"title":"Cite what you read, read what you cite.","authors":"Andrew L Hufton","doi":"10.1016/j.patter.2025.101344","DOIUrl":"10.1016/j.patter.2025.101344","url":null,"abstract":"","PeriodicalId":36242,"journal":{"name":"Patterns","volume":"6 8","pages":"101344"},"PeriodicalIF":7.4,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12365504/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144972412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-07eCollection Date: 2025-11-14DOI: 10.1016/j.patter.2025.101342
Basil Mahfouz, Licia Capra, Geoff Mulgan
Evidence-based policymaking is crucial for addressing societal challenges, yet factors driving research uptake in policy remain unclear. Previous studies have not accounted for the confounding effect of policy relevance, potentially skewing conclusions about impact drivers. Using climate change as a case study, we employ pretrained language models to identify semantically similar research paper pairs where one is cited in policy and the other is not, controlling for inherent policy relevance. This approach allows us to isolate the effects of various factors on policy citation likelihood. We find that in climate change, academic citations are the strongest predictor of policy impact, followed by media mentions. This computational method can be extended to other variables as well as different scientific domains to enable comparative analysis of policy uptake mechanisms across fields.
{"title":"Uncovering drivers of climate research in policy with pretrained language models.","authors":"Basil Mahfouz, Licia Capra, Geoff Mulgan","doi":"10.1016/j.patter.2025.101342","DOIUrl":"10.1016/j.patter.2025.101342","url":null,"abstract":"<p><p>Evidence-based policymaking is crucial for addressing societal challenges, yet factors driving research uptake in policy remain unclear. Previous studies have not accounted for the confounding effect of policy relevance, potentially skewing conclusions about impact drivers. Using climate change as a case study, we employ pretrained language models to identify semantically similar research paper pairs where one is cited in policy and the other is not, controlling for inherent policy relevance. This approach allows us to isolate the effects of various factors on policy citation likelihood. We find that in climate change, academic citations are the strongest predictor of policy impact, followed by media mentions. This computational method can be extended to other variables as well as different scientific domains to enable comparative analysis of policy uptake mechanisms across fields.</p>","PeriodicalId":36242,"journal":{"name":"Patterns","volume":"6 11","pages":"101342"},"PeriodicalIF":7.4,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12664962/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145655734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-05eCollection Date: 2025-10-10DOI: 10.1016/j.patter.2025.101323
Yves Bernaerts, Michael Deistler, Pedro J Gonçalves, Jonas Beck, Marcel Stimberg, Federico Scala, Andreas S Tolias, Jakob H Macke, Dmitry Kobak, Philipp Berens
Neurons have classically been characterized by their anatomy, electrophysiology, and molecular markers. More recently, single-cell transcriptomics has enabled an increasingly fine genetically defined taxonomy of cortical cell types, but the link between the gene expression of individual cell types and their physiological and anatomical properties remains poorly understood. Here, we develop a hybrid modeling approach to bridge this gap: our approach combines statistical and mechanistic models to predict cells' electrophysiological activity from gene expression patterns. To this end, we fit Hodgkin-Huxley-based models for a wide variety of cortical cell types by using simulation-based inference while overcoming the mismatch between model and data. Using multimodal Patch-seq data, we link the estimated model parameters to gene expression using an interpretable linear sparse regression model. Our approach identifies the expression of specific ion channel genes as predictive of biophysical model parameters including ion channel densities, implicating their mechanistic role in determining neural firing properties.
{"title":"Combined statistical-biophysical modeling links ion channel genes to physiology of cortical neuron types.","authors":"Yves Bernaerts, Michael Deistler, Pedro J Gonçalves, Jonas Beck, Marcel Stimberg, Federico Scala, Andreas S Tolias, Jakob H Macke, Dmitry Kobak, Philipp Berens","doi":"10.1016/j.patter.2025.101323","DOIUrl":"10.1016/j.patter.2025.101323","url":null,"abstract":"<p><p>Neurons have classically been characterized by their anatomy, electrophysiology, and molecular markers. More recently, single-cell transcriptomics has enabled an increasingly fine genetically defined taxonomy of cortical cell types, but the link between the gene expression of individual cell types and their physiological and anatomical properties remains poorly understood. Here, we develop a hybrid modeling approach to bridge this gap: our approach combines statistical and mechanistic models to predict cells' electrophysiological activity from gene expression patterns. To this end, we fit Hodgkin-Huxley-based models for a wide variety of cortical cell types by using simulation-based inference while overcoming the mismatch between model and data. Using multimodal Patch-seq data, we link the estimated model parameters to gene expression using an interpretable linear sparse regression model. Our approach identifies the expression of specific ion channel genes as predictive of biophysical model parameters including ion channel densities, implicating their mechanistic role in determining neural firing properties.</p>","PeriodicalId":36242,"journal":{"name":"Patterns","volume":"6 10","pages":"101323"},"PeriodicalIF":7.4,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12546760/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145372432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-01eCollection Date: 2025-08-08DOI: 10.1016/j.patter.2025.101341
Noam Kolt, Michal Shur-Ofry, Reuven Cohen
The study of complex adaptive systems, pioneered in physics, biology, and the social sciences, offers important lessons for artificial intelligence (AI) governance. Contemporary AI systems and the environments in which they operate exhibit many of the properties characteristic of complex systems, including nonlinear growth patterns, emergent phenomena, and cascading effects that can lead to catastrophic failures. Complex systems science can help illuminate the features of AI that pose central challenges for policymakers, such as feedback loops induced by training AI models on synthetic data and the interconnectedness between AI systems and critical infrastructure. Drawing on insights from other domains shaped by complex systems, including public health and climate change, we examine how efforts to govern AI are marked by deep uncertainty. To contend with this challenge, we propose three desiderata for designing a set of complexity-compatible AI governance principles comprised of early and scalable intervention, adaptive institutional design, and risk thresholds calibrated to trigger timely and effective regulatory responses.
{"title":"Lessons from complex systems science for AI governance.","authors":"Noam Kolt, Michal Shur-Ofry, Reuven Cohen","doi":"10.1016/j.patter.2025.101341","DOIUrl":"10.1016/j.patter.2025.101341","url":null,"abstract":"<p><p>The study of complex adaptive systems, pioneered in physics, biology, and the social sciences, offers important lessons for artificial intelligence (AI) governance. Contemporary AI systems and the environments in which they operate exhibit many of the properties characteristic of complex systems, including nonlinear growth patterns, emergent phenomena, and cascading effects that can lead to catastrophic failures. Complex systems science can help illuminate the features of AI that pose central challenges for policymakers, such as feedback loops induced by training AI models on synthetic data and the interconnectedness between AI systems and critical infrastructure. Drawing on insights from other domains shaped by complex systems, including public health and climate change, we examine how efforts to govern AI are marked by deep uncertainty. To contend with this challenge, we propose three desiderata for designing a set of complexity-compatible AI governance principles comprised of early and scalable intervention, adaptive institutional design, and risk thresholds calibrated to trigger timely and effective regulatory responses.</p>","PeriodicalId":36242,"journal":{"name":"Patterns","volume":"6 8","pages":"101341"},"PeriodicalIF":7.4,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12365527/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144972451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-30eCollection Date: 2025-08-08DOI: 10.1016/j.patter.2025.101326
Ping Qiu, Qianqian Chen, Hua Qin, Shuangsang Fang, Yilin Zhang, Yanlin Zhang, Tianyi Xia, Lei Cao, Yong Zhang, Xiaodong Fang, Yuxiang Li, Luni Hu
The application and evaluation of single-cell foundation models (scFMs) present significant challenges due to heterogeneous architectures and coding standards. To address this, we introduce BioLLM (biological large language model), a unified framework for integrating and applying scFMs to single-cell RNA sequencing analysis. BioLLM provides a unified interface that integrates diverse scFMs, eliminating architectural and coding inconsistencies to enable streamlined model access. With standardized APIs and comprehensive documentation, BioLLM supports streamlined model switching and consistent benchmarking. Our comprehensive evaluation of scFMs revealed distinct strengths and limitations, highlighting scGPT's robust performance across all tasks, including zero shot and fine-tuning. Geneformer and scFoundation demonstrated strong capabilities in gene-level tasks, benefiting from effective pretraining strategies. In contrast, scBERT lagged behind, likely due to its smaller model size and limited training data. Ultimately, BioLLM aims to empower the scientific community to leverage the full potential of foundational models, advancing our understanding of complex biological systems through enhanced single-cell analysis.
{"title":"BioLLM: A standardized framework for integrating and benchmarking single-cell foundation models.","authors":"Ping Qiu, Qianqian Chen, Hua Qin, Shuangsang Fang, Yilin Zhang, Yanlin Zhang, Tianyi Xia, Lei Cao, Yong Zhang, Xiaodong Fang, Yuxiang Li, Luni Hu","doi":"10.1016/j.patter.2025.101326","DOIUrl":"10.1016/j.patter.2025.101326","url":null,"abstract":"<p><p>The application and evaluation of single-cell foundation models (scFMs) present significant challenges due to heterogeneous architectures and coding standards. To address this, we introduce BioLLM (biological large language model), a unified framework for integrating and applying scFMs to single-cell RNA sequencing analysis. BioLLM provides a unified interface that integrates diverse scFMs, eliminating architectural and coding inconsistencies to enable streamlined model access. With standardized APIs and comprehensive documentation, BioLLM supports streamlined model switching and consistent benchmarking. Our comprehensive evaluation of scFMs revealed distinct strengths and limitations, highlighting scGPT's robust performance across all tasks, including zero shot and fine-tuning. Geneformer and scFoundation demonstrated strong capabilities in gene-level tasks, benefiting from effective pretraining strategies. In contrast, scBERT lagged behind, likely due to its smaller model size and limited training data. Ultimately, BioLLM aims to empower the scientific community to leverage the full potential of foundational models, advancing our understanding of complex biological systems through enhanced single-cell analysis.</p>","PeriodicalId":36242,"journal":{"name":"Patterns","volume":"6 8","pages":"101326"},"PeriodicalIF":7.4,"publicationDate":"2025-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12365531/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144972404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-30eCollection Date: 2025-09-12DOI: 10.1016/j.patter.2025.101321
Ruowang Li, Luke Benz, Rui Duan, Joshua C Denny, Hakon Hakonarson, Jonathan D Mosley, Jordan W Smoller, Wei-Qi Wei, Thomas Lumley, Marylyn D Ritchie, Jason H Moore, Yong Chen
In cross-cohort studies, integrating diverse datasets is essential and challenging due to cohort-specific variations, distributed data storage, and privacy concerns. Traditional methods often require data pooling or harmonization, which can reduce efficiency and limit the scope of cross-cohort learning. We introduce mixWAS, a one-shot, lossless algorithm that efficiently integrates distributed electronic health record (EHR) datasets via summary statistics. Unlike existing approaches, mixWAS preserves cohort-specific covariate associations and supports simultaneous mixed-outcome analyses. Simulations demonstrate that mixWAS outperforms conventional methods in accuracy and efficiency across various scenarios. Applied to EHR data from seven cohorts in the US, mixWAS identified 4,530 significant cross-cohort genetic associations among traits such as blood lipids, BMI, and circulatory diseases. Validation with an independent UK EHR dataset confirmed 97.7% of these associations, underscoring the algorithm's robustness. By enabling lossless cross-cohort integration, mixWAS improves the precision of multi-outcome analyses and expands the potential for actionable insights in healthcare research.
{"title":"A one-shot, lossless algorithm for cross-cohort learning in mixed-outcomes analysis.","authors":"Ruowang Li, Luke Benz, Rui Duan, Joshua C Denny, Hakon Hakonarson, Jonathan D Mosley, Jordan W Smoller, Wei-Qi Wei, Thomas Lumley, Marylyn D Ritchie, Jason H Moore, Yong Chen","doi":"10.1016/j.patter.2025.101321","DOIUrl":"10.1016/j.patter.2025.101321","url":null,"abstract":"<p><p>In cross-cohort studies, integrating diverse datasets is essential and challenging due to cohort-specific variations, distributed data storage, and privacy concerns. Traditional methods often require data pooling or harmonization, which can reduce efficiency and limit the scope of cross-cohort learning. We introduce mixWAS, a one-shot, lossless algorithm that efficiently integrates distributed electronic health record (EHR) datasets via summary statistics. Unlike existing approaches, mixWAS preserves cohort-specific covariate associations and supports simultaneous mixed-outcome analyses. Simulations demonstrate that mixWAS outperforms conventional methods in accuracy and efficiency across various scenarios. Applied to EHR data from seven cohorts in the US, mixWAS identified 4,530 significant cross-cohort genetic associations among traits such as blood lipids, BMI, and circulatory diseases. Validation with an independent UK EHR dataset confirmed 97.7% of these associations, underscoring the algorithm's robustness. By enabling lossless cross-cohort integration, mixWAS improves the precision of multi-outcome analyses and expands the potential for actionable insights in healthcare research.</p>","PeriodicalId":36242,"journal":{"name":"Patterns","volume":"6 9","pages":"101321"},"PeriodicalIF":7.4,"publicationDate":"2025-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12485519/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145214044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-29eCollection Date: 2025-10-10DOI: 10.1016/j.patter.2025.101320
Lisa Pilgram, Fida Kamal Dankar, Jörg Drechsler, Mark Elliot, Josep Domingo-Ferrer, Paul Francis, Murat Kantarcioglu, Linglong Kong, Bradley Malin, Krishnamurty Muralidhar, Puja Myles, Fabian Prasser, Jean Louis Raisaro, Chao Yan, Khaled El Emam
Synthetic data generation is a promising approach for sharing data for secondary purposes in sensitive sectors. However, to meet ethical standards and legislative requirements, it is necessary to demonstrate that the privacy of the individuals upon which the synthetic records are based is adequately protected. Through an expert consensus process, we developed a framework for privacy evaluation in synthetic data. The most commonly used metrics measure similarity between real and synthetic data and are assumed to capture identity disclosure. Our findings indicate that they lack precise interpretation and should be avoided. There was consensus on the importance of membership and attribute disclosure, both of which involve inferring personal information. The framework provides recommendations to effectively measure these types of disclosures, which also apply to differentially private synthetic data if the privacy budget is not close to zero. We further present future research opportunities to support widespread adoption of synthetic data.
{"title":"A consensus privacy metrics framework for synthetic data.","authors":"Lisa Pilgram, Fida Kamal Dankar, Jörg Drechsler, Mark Elliot, Josep Domingo-Ferrer, Paul Francis, Murat Kantarcioglu, Linglong Kong, Bradley Malin, Krishnamurty Muralidhar, Puja Myles, Fabian Prasser, Jean Louis Raisaro, Chao Yan, Khaled El Emam","doi":"10.1016/j.patter.2025.101320","DOIUrl":"10.1016/j.patter.2025.101320","url":null,"abstract":"<p><p>Synthetic data generation is a promising approach for sharing data for secondary purposes in sensitive sectors. However, to meet ethical standards and legislative requirements, it is necessary to demonstrate that the privacy of the individuals upon which the synthetic records are based is adequately protected. Through an expert consensus process, we developed a framework for privacy evaluation in synthetic data. The most commonly used metrics measure similarity between real and synthetic data and are assumed to capture identity disclosure. Our findings indicate that they lack precise interpretation and should be avoided. There was consensus on the importance of membership and attribute disclosure, both of which involve inferring personal information. The framework provides recommendations to effectively measure these types of disclosures, which also apply to differentially private synthetic data if the privacy budget is not close to zero. We further present future research opportunities to support widespread adoption of synthetic data.</p>","PeriodicalId":36242,"journal":{"name":"Patterns","volume":"6 10","pages":"101320"},"PeriodicalIF":7.4,"publicationDate":"2025-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12546437/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145379152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-23eCollection Date: 2025-11-14DOI: 10.1016/j.patter.2025.101325
Nicholas Kluge Corrêa, Aniket Sen, Sophia Falk, Shiza Fatimah
Natural language processing has seen substantial progress in recent years. However, current deep-learning-based language models demand extensive data and computational resources. This data-intensive paradigm has led to a divide between high-resource languages, where development is thriving, and low-resource languages, which lag behind. To address this disparity, this study introduces a new set of resources to advance neural text generation for Portuguese. Here, we document the development of GigaVerbo, a Portuguese text corpus amounting to 200 billion tokens. Using this corpus, we trained Tucano, a family of decoder-only transformer models. Our models consistently outperform comparable Portuguese and multilingual models on several benchmarks. All models, datasets, and tools developed in this work are openly available to the community to support reproducible research.
{"title":"Tucano: Advancing neural text generation for Portuguese.","authors":"Nicholas Kluge Corrêa, Aniket Sen, Sophia Falk, Shiza Fatimah","doi":"10.1016/j.patter.2025.101325","DOIUrl":"10.1016/j.patter.2025.101325","url":null,"abstract":"<p><p>Natural language processing has seen substantial progress in recent years. However, current deep-learning-based language models demand extensive data and computational resources. This data-intensive paradigm has led to a divide between high-resource languages, where development is thriving, and low-resource languages, which lag behind. To address this disparity, this study introduces a new set of resources to advance neural text generation for Portuguese. Here, we document the development of GigaVerbo, a Portuguese text corpus amounting to 200 billion tokens. Using this corpus, we trained Tucano, a family of decoder-only transformer models. Our models consistently outperform comparable Portuguese and multilingual models on several benchmarks. All models, datasets, and tools developed in this work are openly available to the community to support reproducible research.</p>","PeriodicalId":36242,"journal":{"name":"Patterns","volume":"6 11","pages":"101325"},"PeriodicalIF":7.4,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12664968/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145655771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-21eCollection Date: 2025-11-14DOI: 10.1016/j.patter.2025.101313
Juli Bakagianni, Kanella Pouli, Maria Gavriilidou, John Pavlopoulos
Comprehensive monolingual natural language processing (NLP) surveys are essential for assessing language-specific challenges, resource availability, and research gaps. However, existing surveys often lack standardized methodologies, leading to selection bias and fragmented coverage of NLP tasks and resources. This study introduces a generalizable framework for systematic monolingual NLP surveys. Our approach integrates a structured search protocol to minimize bias, an NLP task taxonomy for classification, and language resource taxonomies to identify potential benchmarks and highlight opportunities for improving resource availability. We apply this framework to Greek NLP (2012-2023), providing an in-depth analysis of its current state, task-specific progress, and resource gaps. The survey results are publicly available and are regularly updated to provide an evergreen resource. This systematic survey of Greek NLP serves as a case study, demonstrating the effectiveness of our framework and its potential for broader application to other not-so-well-resourced languages as regards NLP.
{"title":"A systematic survey of natural language processing for the Greek language.","authors":"Juli Bakagianni, Kanella Pouli, Maria Gavriilidou, John Pavlopoulos","doi":"10.1016/j.patter.2025.101313","DOIUrl":"10.1016/j.patter.2025.101313","url":null,"abstract":"<p><p>Comprehensive monolingual natural language processing (NLP) surveys are essential for assessing language-specific challenges, resource availability, and research gaps. However, existing surveys often lack standardized methodologies, leading to selection bias and fragmented coverage of NLP tasks and resources. This study introduces a generalizable framework for systematic monolingual NLP surveys. Our approach integrates a structured search protocol to minimize bias, an NLP task taxonomy for classification, and language resource taxonomies to identify potential benchmarks and highlight opportunities for improving resource availability. We apply this framework to Greek NLP (2012-2023), providing an in-depth analysis of its current state, task-specific progress, and resource gaps. The survey results are publicly available and are regularly updated to provide an evergreen resource. This systematic survey of Greek NLP serves as a case study, demonstrating the effectiveness of our framework and its potential for broader application to other not-so-well-resourced languages as regards NLP.</p>","PeriodicalId":36242,"journal":{"name":"Patterns","volume":"6 11","pages":"101313"},"PeriodicalIF":7.4,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12715428/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145805594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}