Background & objective: Most existing methods for indirectly deriving reference intervals (RIs) from routine laboratory databases use univariate approaches with limited or no rigorous data cleaning. Recognizing the potential of multivariate data-mining strategies, we developed novel software-SOM-clean-that employs self-organizing map (SOM) clustering for iterative exclusion of records exhibiting atypical multi-test patterns.
Methods: We retrieved records for 22 major health-screening tests (HSTs) from a Saudi Arabian laboratory participating in a RI study. After excluding records from frequently tested individuals and those with <10 HST results, 37,285 records remained for analysis. Initial crude RIs were calculated parametrically using a two-parameter Box-Cox power transformation. All transformed values were standardized against these RIs to generate uniform-scale values, so that any result within RI limits fell between ±1.96. The self-organizing map (m × m cells, m = 5-8) was initialized with normal random values, and records were clustered into cells with highest similarity. Cells' patterns were updated by records assigned to each of them. This learning process of the map was repeated until equilibrium. Subsequently, cells exhibiting atypical features were excluded, and RIs were recalculated using records from the remaining cells. This process was repeated iteratively until all RIs stabilized.
Results: Histograms of retrieved results frequently exhibited peaks differing in shape and location from those in the direct study (n = 880). The goodness-of-fit (GOF) of SOM-clean RIs was assessed by skewness, kurtosis, and Kolmogorov-Smirnov test P-values after transformation, as well as by the bias ratio of reference limits compared with the direct study. GOF depended on map size and criteria for identifying atypical cells; the software therefore incorporated an all-inclusive search for optimal conditions referencing the direct study RIs. By using the optimal settings, SOM-clean achieved excellent GOF of RIs simultaneously across nearly all HSTs, indicating conformity of the estimated RIs to the healthy status. In comparison, RIs derived using a representative indirect method (refineR) were generally broader or biased, particularly for tests with highly skewed distributions.
Conclusion: SOM-clean represents a practical and robust parametric tool for estimating RIs indirectly from routine laboratory data employing a novel multivariate-based data cleaning scheme.
扫码关注我们
求助内容:
应助结果提醒方式:
