N Xie, W J Bi, Z W Zhang, F Shao, Y Y Wei, Y Zhao, R Y Zhang, F Chen
{"title":"[Statistical methods for extremely unbalanced data in genome-wide association study (1)].","authors":"N Xie, W J Bi, Z W Zhang, F Shao, Y Y Wei, Y Zhao, R Y Zhang, F Chen","doi":"10.3760/cma.j.cn112338-20240506-00235","DOIUrl":null,"url":null,"abstract":"<p><p>Extremely unbalanced data here refers to datasets where the values of independent or dependent variables exhibit severe unbalance in proportions, such as extremely unbalanced case-control ratio, very low incidence rate of disease, heavily censored time-to-event data, and low-frequency or rare variants. In such scenarios, the statistic derived from hypothesis test using the classical statistical method, e.g., logistic regression model and Cox proportional hazard regression model, might deviate from theoretical asymptotic distribution, resulting in inflation or deflation of type I error. With the increased availability and exploration of resources from large-scale population cohorts in genome-wide association study (GWAS), there is a growing demand for effective and accurate statistical approaches to handle extremely unbalanced data in independent and non-independent samples. Our study introduces classical statistical methods in genetic statistics firstly, then, summarizes the failure of classical statistical methods in dealing with extremely unbalanced data through simulation experiments to draw researchers' attention to the extremely unbalanced data in GWAS.</p>","PeriodicalId":23968,"journal":{"name":"中华流行病学杂志","volume":"45 11","pages":"1582-1589"},"PeriodicalIF":0.0000,"publicationDate":"2024-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"中华流行病学杂志","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3760/cma.j.cn112338-20240506-00235","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0
Abstract
Extremely unbalanced data here refers to datasets where the values of independent or dependent variables exhibit severe unbalance in proportions, such as extremely unbalanced case-control ratio, very low incidence rate of disease, heavily censored time-to-event data, and low-frequency or rare variants. In such scenarios, the statistic derived from hypothesis test using the classical statistical method, e.g., logistic regression model and Cox proportional hazard regression model, might deviate from theoretical asymptotic distribution, resulting in inflation or deflation of type I error. With the increased availability and exploration of resources from large-scale population cohorts in genome-wide association study (GWAS), there is a growing demand for effective and accurate statistical approaches to handle extremely unbalanced data in independent and non-independent samples. Our study introduces classical statistical methods in genetic statistics firstly, then, summarizes the failure of classical statistical methods in dealing with extremely unbalanced data through simulation experiments to draw researchers' attention to the extremely unbalanced data in GWAS.
期刊介绍:
Chinese Journal of Epidemiology, established in 1981, is an advanced academic periodical in epidemiology and related disciplines in China, which, according to the principle of integrating theory with practice, mainly reports the major progress in epidemiological research. The columns of the journal include commentary, expert forum, original article, field investigation, disease surveillance, laboratory research, clinical epidemiology, basic theory or method and review, etc.
The journal is included by more than ten major biomedical databases and index systems worldwide, such as been indexed in Scopus, PubMed/MEDLINE, PubMed Central (PMC), Europe PubMed Central, Embase, Chemical Abstract, Chinese Science and Technology Paper and Citation Database (CSTPCD), Chinese core journal essentials overview, Chinese Science Citation Database (CSCD) core database, Chinese Biological Medical Disc (CBMdisc), and Chinese Medical Citation Index (CMCI), etc. It is one of the core academic journals and carefully selected core journals in preventive and basic medicine in China.