Application of Benford's Law for Quality Assessment of Preventive Screening Data

O. Starunova, S. Rudnev, A. Ivanova, V. G. Semenova, V. Starodubov
{"title":"Application of Benford's Law for Quality Assessment of Preventive Screening Data","authors":"O. Starunova, S. Rudnev, A. Ivanova, V. G. Semenova, V. Starodubov","doi":"10.17537/2022.17.230","DOIUrl":null,"url":null,"abstract":"\n An empirical Benford's law which describes the probability of the appearance of certain first significant digits in many distributions taken from real life, is used to identify anomalies in various kinds of data. Our aim was to test Benford's law to assess the quality of mass preventive screening data on the example of bioelectrical impedance analysis (BIA) data from Moscow health centers. As was shown earlier, such a data is characterized by a high level of contamination by artificially generated and falsified data. A generated 2010–2019 database of BIA measurements contained 1361019 measurement records in the age range of the examined persons from 5 to 96 years. Application of the expert quality assessment algorithm, which was used as a reference for evaluation of the effectiveness of Benford analysis, revealed a high percentage of incorrect data (66.5 %) which was dominated by falsified data. To characterize the degree of the data compliance with Benford's law, the mean absolute deviations of the frequency distributions of the first and first two significant digits deviations from the proper values and chi-squared statistics for the tenth powers of the standardized resistance, reactance, and resistance index values were assessed for each health center. A significant correlation was observed between the data deviation from Benford's law and the percentage of incorrect data as provided by the expert quality assessment algorithm (ρmax = 0.66 and 0.62 for the mean absolute deviations and χ2 statistics, respectively, based on the resistance value and the first significant digit). It is suggested that deviation of the BIA data from Benford's law serves as a sufficient, but not a necessary, condition for their contamination. For those health centers, in which most of the incorrect data were represented by multiple measurements of the same person under the guise of different ones, the data were in good agreement with Benford's law. If the structure of incorrect data was dominated by measurements of the calibration block, software emulations of BIA measurements and outliers, then the use of Benford's law made it possible to effectively rank health centers by the level of data authenticity.\n","PeriodicalId":53525,"journal":{"name":"Mathematical Biology and Bioinformatics","volume":"49 2 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mathematical Biology and Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17537/2022.17.230","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}
引用次数: 0

Abstract

An empirical Benford's law which describes the probability of the appearance of certain first significant digits in many distributions taken from real life, is used to identify anomalies in various kinds of data. Our aim was to test Benford's law to assess the quality of mass preventive screening data on the example of bioelectrical impedance analysis (BIA) data from Moscow health centers. As was shown earlier, such a data is characterized by a high level of contamination by artificially generated and falsified data. A generated 2010–2019 database of BIA measurements contained 1361019 measurement records in the age range of the examined persons from 5 to 96 years. Application of the expert quality assessment algorithm, which was used as a reference for evaluation of the effectiveness of Benford analysis, revealed a high percentage of incorrect data (66.5 %) which was dominated by falsified data. To characterize the degree of the data compliance with Benford's law, the mean absolute deviations of the frequency distributions of the first and first two significant digits deviations from the proper values and chi-squared statistics for the tenth powers of the standardized resistance, reactance, and resistance index values were assessed for each health center. A significant correlation was observed between the data deviation from Benford's law and the percentage of incorrect data as provided by the expert quality assessment algorithm (ρmax = 0.66 and 0.62 for the mean absolute deviations and χ2 statistics, respectively, based on the resistance value and the first significant digit). It is suggested that deviation of the BIA data from Benford's law serves as a sufficient, but not a necessary, condition for their contamination. For those health centers, in which most of the incorrect data were represented by multiple measurements of the same person under the guise of different ones, the data were in good agreement with Benford's law. If the structure of incorrect data was dominated by measurements of the calibration block, software emulations of BIA measurements and outliers, then the use of Benford's law made it possible to effectively rank health centers by the level of data authenticity.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Benford定律在预防筛查数据质量评价中的应用
一个经验本福德定律描述了从现实生活中提取的许多分布中出现某些第一有效数字的概率,用于识别各种数据中的异常。我们的目的是以莫斯科卫生中心的生物电阻抗分析(BIA)数据为例,检验Benford定律以评估大规模预防性筛查数据的质量。如前所述,这种数据的特点是受到人为产生和伪造数据的高度污染。生成的2010-2019年BIA测量数据库包含1361019条测量记录,其年龄范围为5至96岁。应用专家素质评估算法作为评价本福德分析有效性的参考,发现数据不正确的比例很高(66.5%),其中以伪造数据为主。为了描述数据符合本福德定律的程度,对每个医疗中心的标准化电阻、电抗和电阻指标值的十分之一幂的频率分布的第一个和前两个有效数字偏离正确值的平均绝对偏差和卡方统计进行了评估。数据偏离本福德定律与专家质量评估算法提供的不正确数据百分比之间存在显著相关(基于阻值和第一位有效数字的平均绝对偏差和χ2统计量的ρmax分别= 0.66和0.62)。认为BIA数据偏离本福德定律是其污染的充分条件,但不是必要条件。对于那些医疗中心来说,大多数不正确的数据都是在不同的幌子下对同一个人进行多次测量,这些数据与本福德定律非常吻合。如果不正确数据的结构是由校准块的测量、BIA测量的软件模拟和异常值所主导的,那么使用本福德定律可以根据数据真实性水平有效地对医疗中心进行排名。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Mathematical Biology and Bioinformatics
Mathematical Biology and Bioinformatics Mathematics-Applied Mathematics
CiteScore
1.10
自引率
0.00%
发文量
13
期刊最新文献
Modeling Growth and Photoadaptation of Porphyridium purpureum Batch Culture Mathematical Modeling of the Initial Period of Spread of HIV-1 Infection in the Lymphatic Node Mathematical Model of Closed Microecosystem “Algae – Heterotrophic Bacteria” Using a Drug Repurposing Strategy to Virtually Screen Potential HIV-1 Entry Inhibitors That Block the NHR Domain of the Viral Envelope Protein gp41 Applying Laplace Transformation on Epidemiological Models as Caputo Derivatives
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1