Inside the “brain” of an artificial neural network: an interpretable deep learning approach to paroxysmal atrial fibrillation diagnosis from electrocardiogram signals during sinus rhythm
P. Pantelidis, E. Oikonomou, S. Lampsas, N. Souvaliotis, M. Spartalis, M. Vavuranakis, M. Bampa, P. Papapetrou, G. Siasos, M. Vavuranakis
{"title":"Inside the “brain” of an artificial neural network: an interpretable deep learning approach to paroxysmal atrial fibrillation diagnosis from electrocardiogram signals during sinus rhythm","authors":"P. Pantelidis, E. Oikonomou, S. Lampsas, N. Souvaliotis, M. Spartalis, M. Vavuranakis, M. Bampa, P. Papapetrou, G. Siasos, M. Vavuranakis","doi":"10.1093/ehjdh/ztac076.2781","DOIUrl":null,"url":null,"abstract":"Abstract Background With the ongoing, rapid advances in Deep Learning (DL), such solutions can now detect medical conditions even invisible to the human eye. In this direction, efforts have been made to develop DL algorithms that diagnose paroxysmal atrial fibrillation (PAF) from electrocardiogram (ECG) signals in sinus rhythm (SR). However, many of the available approaches function as “black boxes”, with physicians unable to understand and trust their predictions. Purpose To train a DL model to detect PAF patients while in SR and apply an algorithm that interprets and visualises its decisions. Methods We obtained ECG samples from PAF and non-PAF patients during SR, from the PAF Prediction Challenge Database. After discarding unannotated samples and augmenting the sample size (by dividing each signal into 30-second segments), we split the whole dataset into a train (68%), a validation (16%) and a test (16%) set. No pair of samples belonging to different sets originated from the same patient. We trained the InceptionTime neural network on the train/validation sets and tested on the “unseen” test set after “hiding” the correct answers. Its performance was evaluated with the following metrics: Accuracy, f1-score, precision and recall (sensitivity). After repeating this process 20 times, we obtained a distribution for each score. Finally, we adjusted the Grad-CAM interpretation algorithm to our data and used it to visualise the areas perceived as important by the model. Results After pre-processing, 4,080, 30-second, two-lead ECG signals were allocated to the train set, 960 to the validation and 960 to the test set. Each subset contained an equal number of PAF and non-PAF samples. After repeated training and testing, we obtained a median accuracy of 0.84 (interquartile range, IQR: 0.66–0.88), an f1-score of 0.82 (IQR: 0.68–0.88) and a median precision and recall equal to 0.93 (IQR: 0.67–0.99) and 0.77 (IQR: 0.68–0.93), respectively. The Grad-CAM technique highlighted the ECG areas of interest that led to each decision. We selected and present both PAF-positive and -negative samples, perceived either correctly or falsely. Interestingly, correct model decisions tend to focus on the P-wave, while false ones fixate on other regions. Conclusions Although a pilot study with considerable limitations (small sample size, disregard of possible confounding due to comorbidities or other factors), this work shows how DL can be employed to distinguish between PAF and non-PAF patients from SR ECG samples, and confirms the potential of DL-enabled approaches to offer novel diagnostic capabilities. Most importantly, our effort provides a comprehensible, visual interpretation of the model's decisions. Demystifying DL behaviour can, not only improve such efforts by explaining false decisions, but also cultivate trust among clinicians and, possibly, point out directions for future research, since we can now see through the magnifying lens of a neural network. Funding Acknowledgement Type of funding sources: None. Figure 1 Figure 2","PeriodicalId":72965,"journal":{"name":"European heart journal. Digital health","volume":null,"pages":null},"PeriodicalIF":3.9000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European heart journal. Digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/ehjdh/ztac076.2781","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CARDIAC & CARDIOVASCULAR SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Abstract Background With the ongoing, rapid advances in Deep Learning (DL), such solutions can now detect medical conditions even invisible to the human eye. In this direction, efforts have been made to develop DL algorithms that diagnose paroxysmal atrial fibrillation (PAF) from electrocardiogram (ECG) signals in sinus rhythm (SR). However, many of the available approaches function as “black boxes”, with physicians unable to understand and trust their predictions. Purpose To train a DL model to detect PAF patients while in SR and apply an algorithm that interprets and visualises its decisions. Methods We obtained ECG samples from PAF and non-PAF patients during SR, from the PAF Prediction Challenge Database. After discarding unannotated samples and augmenting the sample size (by dividing each signal into 30-second segments), we split the whole dataset into a train (68%), a validation (16%) and a test (16%) set. No pair of samples belonging to different sets originated from the same patient. We trained the InceptionTime neural network on the train/validation sets and tested on the “unseen” test set after “hiding” the correct answers. Its performance was evaluated with the following metrics: Accuracy, f1-score, precision and recall (sensitivity). After repeating this process 20 times, we obtained a distribution for each score. Finally, we adjusted the Grad-CAM interpretation algorithm to our data and used it to visualise the areas perceived as important by the model. Results After pre-processing, 4,080, 30-second, two-lead ECG signals were allocated to the train set, 960 to the validation and 960 to the test set. Each subset contained an equal number of PAF and non-PAF samples. After repeated training and testing, we obtained a median accuracy of 0.84 (interquartile range, IQR: 0.66–0.88), an f1-score of 0.82 (IQR: 0.68–0.88) and a median precision and recall equal to 0.93 (IQR: 0.67–0.99) and 0.77 (IQR: 0.68–0.93), respectively. The Grad-CAM technique highlighted the ECG areas of interest that led to each decision. We selected and present both PAF-positive and -negative samples, perceived either correctly or falsely. Interestingly, correct model decisions tend to focus on the P-wave, while false ones fixate on other regions. Conclusions Although a pilot study with considerable limitations (small sample size, disregard of possible confounding due to comorbidities or other factors), this work shows how DL can be employed to distinguish between PAF and non-PAF patients from SR ECG samples, and confirms the potential of DL-enabled approaches to offer novel diagnostic capabilities. Most importantly, our effort provides a comprehensible, visual interpretation of the model's decisions. Demystifying DL behaviour can, not only improve such efforts by explaining false decisions, but also cultivate trust among clinicians and, possibly, point out directions for future research, since we can now see through the magnifying lens of a neural network. Funding Acknowledgement Type of funding sources: None. Figure 1 Figure 2