Unsupervised algorithms to identify potential under-coding of secondary diagnoses in hospitalisations databases in Portugal.

IF 1.8 Health information management : journal of the Health Information Management Association of Australia Pub Date : 2024-09-01 Epub Date: 2023-02-17 DOI:10.1177/18333583221144663

Diana Portela, Rita Amaral, Pedro P Rodrigues, Alberto Freitas, Elísio Costa, João A Fonseca, Bernardo Sousa-Pinto

{"title":"Unsupervised algorithms to identify potential under-coding of secondary diagnoses in hospitalisations databases in Portugal.","authors":"Diana Portela, Rita Amaral, Pedro P Rodrigues, Alberto Freitas, Elísio Costa, João A Fonseca, Bernardo Sousa-Pinto","doi":"10.1177/18333583221144663","DOIUrl":null,"url":null,"abstract":"Background: Quantifying and dealing with lack of consistency in administrative databases (namely, under-coding) requires tracking patients longitudinally without compromising anonymity, which is often a challenging task.Objective: This study aimed to (i) assess and compare different hierarchical clustering methods on the identification of individual patients in an administrative database that does not easily allow tracking of episodes from the same patient; (ii) quantify the frequency of potential under-coding; and (iii) identify factors associated with such phenomena.Method: We analysed the Portuguese National Hospital Morbidity Dataset, an administrative database registering all hospitalisations occurring in Mainland Portugal between 2011-2015. We applied different approaches of hierarchical clustering methods (either isolated or combined with partitional clustering methods), to identify potential individual patients based on demographic variables and comorbidities. Diagnoses codes were grouped into the Charlson an Elixhauser comorbidity defined groups. The algorithm displaying the best performance was used to quantify potential under-coding. A generalised mixed model (GML) of binomial regression was applied to assess factors associated with such potential under-coding.Results: We observed that the hierarchical cluster analysis (HCA) + k-means clustering method with comorbidities grouped according to the Charlson defined groups was the algorithm displaying the best performance (with a Rand Index of 0.99997). We identified potential under-coding in all Charlson comorbidity groups, ranging from 3.5% (overall diabetes) to 27.7% (asthma). Overall, being male, having medical admission, dying during hospitalisation or being admitted at more specific and complex hospitals were associated with increased odds of potential under-coding.Discussion: We assessed several approaches to identify individual patients in an administrative database and, subsequently, by applying HCA + k-means algorithm, we tracked coding inconsistency and potentially improved data quality. We reported consistent potential under-coding in all defined groups of comorbidities and potential factors associated with such lack of completeness.Conclusion: Our proposed methodological framework could both enhance data quality and act as a reference for other studies relying on databases with similar problems.","PeriodicalId":73210,"journal":{"name":"Health information management : journal of the Health Information Management Association of Australia","volume":" ","pages":"174-182"},"PeriodicalIF":1.8000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11408983/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Health information management : journal of the Health Information Management Association of Australia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/18333583221144663","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/2/17 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Quantifying and dealing with lack of consistency in administrative databases (namely, under-coding) requires tracking patients longitudinally without compromising anonymity, which is often a challenging task.

Objective: This study aimed to (i) assess and compare different hierarchical clustering methods on the identification of individual patients in an administrative database that does not easily allow tracking of episodes from the same patient; (ii) quantify the frequency of potential under-coding; and (iii) identify factors associated with such phenomena.

Method: We analysed the Portuguese National Hospital Morbidity Dataset, an administrative database registering all hospitalisations occurring in Mainland Portugal between 2011-2015. We applied different approaches of hierarchical clustering methods (either isolated or combined with partitional clustering methods), to identify potential individual patients based on demographic variables and comorbidities. Diagnoses codes were grouped into the Charlson an Elixhauser comorbidity defined groups. The algorithm displaying the best performance was used to quantify potential under-coding. A generalised mixed model (GML) of binomial regression was applied to assess factors associated with such potential under-coding.

Results: We observed that the hierarchical cluster analysis (HCA) + k-means clustering method with comorbidities grouped according to the Charlson defined groups was the algorithm displaying the best performance (with a Rand Index of 0.99997). We identified potential under-coding in all Charlson comorbidity groups, ranging from 3.5% (overall diabetes) to 27.7% (asthma). Overall, being male, having medical admission, dying during hospitalisation or being admitted at more specific and complex hospitals were associated with increased odds of potential under-coding.

Discussion: We assessed several approaches to identify individual patients in an administrative database and, subsequently, by applying HCA + k-means algorithm, we tracked coding inconsistency and potentially improved data quality. We reported consistent potential under-coding in all defined groups of comorbidities and potential factors associated with such lack of completeness.

Conclusion: Our proposed methodological framework could both enhance data quality and act as a reference for other studies relying on databases with similar problems.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用无监督算法识别葡萄牙住院数据库中可能存在的二次诊断编码不足问题。

背景：量化和处理行政数据库中缺乏一致性的问题（即编码不足）需要在不损害匿名性的情况下对患者进行纵向追踪，而这往往是一项具有挑战性的任务：本研究的目的是：(i) 评估和比较不同的分层聚类方法对行政数据库中单个患者的识别，因为该数据库不便于追踪同一患者的发病情况；(ii) 量化潜在的编码不足频率；(iii) 确定与此类现象相关的因素：我们分析了葡萄牙全国医院发病率数据集，该行政数据库登记了 2011-2015 年间葡萄牙大陆的所有住院病例。我们采用了不同的分层聚类方法（单独或与分区聚类方法相结合），根据人口统计学变量和合并症识别潜在的个体患者。诊断代码被归入夏尔森（Charlson）和埃利克豪斯（Elixhauser）合并症定义的组别。显示最佳性能的算法被用来量化潜在的编码不足。应用二项回归的广义混合模型（GML）来评估与此类潜在编码不足相关的因素：我们发现，根据 Charlson 定义的组别对合并症进行分组的分层聚类分析 (HCA) + k-means 聚类方法是性能最佳的算法（兰德指数为 0.99997）。我们在所有 Charlson 合并症组别中都发现了潜在的编码不足，从 3.5%（糖尿病总体）到 27.7%（哮喘）不等。总体而言，男性、因病入院、住院期间死亡或在更特殊、更复杂的医院住院与潜在编码不足的几率增加有关：讨论：我们评估了几种在行政数据库中识别个体患者的方法，随后，通过应用 HCA + k-means 算法，我们追踪了编码不一致的情况，并潜在地提高了数据质量。我们报告了在所有确定的合并症组别中可能存在的编码不足情况，以及与这种不完整性相关的潜在因素：我们提出的方法框架既能提高数据质量，又能为其他依赖于存在类似问题的数据库的研究提供参考。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Health information management : journal of the Health Information Management Association of Australia

自引率

0.00%

发文量