Wolfgang Ganglberger, Samaneh Nasiri, Haoqi Sun, Soriul Kim, Chol Shin, M Brandon Westover, Robert J Thomas
{"title":"Refining sleep staging accuracy: transfer learning coupled with scorability models.","authors":"Wolfgang Ganglberger, Samaneh Nasiri, Haoqi Sun, Soriul Kim, Chol Shin, M Brandon Westover, Robert J Thomas","doi":"10.1093/sleep/zsae202","DOIUrl":null,"url":null,"abstract":"<p><strong>Study objectives: </strong>This study aimed to (1) improve sleep staging accuracy through transfer learning (TL), to achieve or exceed human inter-expert agreement and (2) introduce a scorability model to assess the quality and trustworthiness of automated sleep staging.</p><p><strong>Methods: </strong>A deep neural network (base model) was trained on a large multi-site polysomnography (PSG) dataset from the United States. TL was used to calibrate the model to a reduced montage and limited samples from the Korean Genome and Epidemiology Study (KoGES) dataset. Model performance was compared to inter-expert reliability among three human experts. A scorability assessment was developed to predict the agreement between the model and human experts.</p><p><strong>Results: </strong>Initial sleep staging by the base model showed lower agreement with experts (κ = 0.55) compared to the inter-expert agreement (κ = 0.62). Calibration with 324 randomly sampled training cases matched expert agreement levels. Further targeted sampling improved performance, with models exceeding inter-expert agreement (κ = 0.70). The scorability assessment, combining biosignal quality and model confidence features, predicted model-expert agreement moderately well (R² = 0.42). Recordings with higher scorability scores demonstrated greater model-expert agreement than inter-expert agreement. Even with lower scorability scores, model performance was comparable to inter-expert agreement.</p><p><strong>Conclusions: </strong>Fine-tuning a pretrained neural network through targeted TL significantly enhances sleep staging performance for an atypical montage, achieving and surpassing human expert agreement levels. The introduction of a scorability assessment provides a robust measure of reliability, ensuring quality control and enhancing the practical application of the system before deployment. This approach marks an important advancement in automated sleep analysis, demonstrating the potential for AI to exceed human performance in clinical settings.</p>","PeriodicalId":22018,"journal":{"name":"Sleep","volume":null,"pages":null},"PeriodicalIF":5.6000,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sleep","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1093/sleep/zsae202","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0
Abstract
Study objectives: This study aimed to (1) improve sleep staging accuracy through transfer learning (TL), to achieve or exceed human inter-expert agreement and (2) introduce a scorability model to assess the quality and trustworthiness of automated sleep staging.
Methods: A deep neural network (base model) was trained on a large multi-site polysomnography (PSG) dataset from the United States. TL was used to calibrate the model to a reduced montage and limited samples from the Korean Genome and Epidemiology Study (KoGES) dataset. Model performance was compared to inter-expert reliability among three human experts. A scorability assessment was developed to predict the agreement between the model and human experts.
Results: Initial sleep staging by the base model showed lower agreement with experts (κ = 0.55) compared to the inter-expert agreement (κ = 0.62). Calibration with 324 randomly sampled training cases matched expert agreement levels. Further targeted sampling improved performance, with models exceeding inter-expert agreement (κ = 0.70). The scorability assessment, combining biosignal quality and model confidence features, predicted model-expert agreement moderately well (R² = 0.42). Recordings with higher scorability scores demonstrated greater model-expert agreement than inter-expert agreement. Even with lower scorability scores, model performance was comparable to inter-expert agreement.
Conclusions: Fine-tuning a pretrained neural network through targeted TL significantly enhances sleep staging performance for an atypical montage, achieving and surpassing human expert agreement levels. The introduction of a scorability assessment provides a robust measure of reliability, ensuring quality control and enhancing the practical application of the system before deployment. This approach marks an important advancement in automated sleep analysis, demonstrating the potential for AI to exceed human performance in clinical settings.
期刊介绍:
SLEEP® publishes findings from studies conducted at any level of analysis, including:
Genes
Molecules
Cells
Physiology
Neural systems and circuits
Behavior and cognition
Self-report
SLEEP® publishes articles that use a wide variety of scientific approaches and address a broad range of topics. These may include, but are not limited to:
Basic and neuroscience studies of sleep and circadian mechanisms
In vitro and animal models of sleep, circadian rhythms, and human disorders
Pre-clinical human investigations, including the measurement and manipulation of sleep and circadian rhythms
Studies in clinical or population samples. These may address factors influencing sleep and circadian rhythms (e.g., development and aging, and social and environmental influences) and relationships between sleep, circadian rhythms, health, and disease
Clinical trials, epidemiology studies, implementation, and dissemination research.