What can go wrong when observations are not independently and identically distributed: A cautionary note on calculating correlations on combined data sets from different experiments or conditions.

IF 2.3 Frontiers in systems biology Pub Date : 2023-01-30 eCollection Date: 2023-01-01 DOI:10.3389/fsysb.2023.1042156

Edoardo Saccenti

引用次数: 0

Abstract

In the scientific literature data analysis results are often presented when samples from different experiments or different conditions, technical replicates or times series are merged to increase the sample size before calculating the correlation coefficient. This way of proceeding violates two basic assumptions underlying the use of the correlation coefficient: sampling from one population and independence of the observations (independence of errors). Since correlations are used to measure and infer associations between biological entities, this has tremendous implications on the reliability of scientific results, as the violation of these assumption leads to wrong and biased results. In this technical note, I review some basic properties of the Pearson's correlation coefficient and illustrate some exemplary problems with simulated and experimental data, taking a didactic approach with the use of supporting graphical examples.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

当观测结果不是独立且相同分布时，会出现什么问题：关于计算不同实验或条件下组合数据集相关性的注意事项

在科学文献中，数据分析结果往往是通过合并不同实验或不同条件下的样本、技术重复或时间序列来增加样本量，然后再计算相关系数。这种处理方式违反了使用相关系数的两个基本假设:从一个总体中抽样和观察结果的独立性(误差的独立性)。由于相关性是用来衡量和推断生物实体之间的关联的，这对科学结果的可靠性有着巨大的影响，因为违反这些假设会导致错误和有偏差的结果。在这篇技术笔记中，我回顾了皮尔逊相关系数的一些基本性质，并用模拟和实验数据说明了一些示例性问题，采用教学方法使用支持图形示例。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Frontiers in systems biology

自引率

0.00%

发文量