查看原文
其他

Five things biologists should know about statistics

2017-01-13 Y叔 biobabble


《什么是T检验》这篇有看到最后的,都是真爱,我说了,本来不和他计较,但我评论说他抄了还标原创,竟然还有理了,这就太恶劣了,所以我要把从我博客上抄的几篇晒出来,自己申明原创。然后有人在QQ群里说,中国人为难中国人!也是很牛逼,一并晒一下吧。


最近的一篇博文()讲述了统计对于生物学的重要性。

一开始从RA Fisher讲起,说生物压根就是统计。Fisher是个农业学家,他所建立的那些统计方法,都是从生物学问题出发。

Ewan所谈及的五个方面分别是:

1. Non parametric statistics. These are statistical tests which make a
bare minimum of assumptions of underlying distributions; in biology we
are rarely confident that we know the underlying distribution, and hand
waving about central limit theorem can only get you so far. Wherever
possible you should use a non parameteric test. This is Mann-Whitney (or
Wilcoxon if you prefer) for testing “medians” (Medians is in quotes
because this is not quite true. They test something which is closely
related to the median) of two distributions, Spearman’s Rho (rather
pearson’s r2) for correlation, and the Kruskal test rather than ANOVAs
(though if I get this right, you can’t in Kruskal do the more
sophisticated nested models you can do with ANOVA). Finally, don’t
forget the rather wonderful Kolmogorov-Smirnov (I always think it sounds
like really good vodka) test of whether two sets of observations come
from the same distribution. All of these methods have a basic theme of
doing things on the rank of items in a distribution, not the actual
level. So - if in doubt, do things on the rank of metric, rather than
the metric itself.

学校里教统计,多半是t检验和ANOVA,这些方法都有assumption需要满足,比如正态分布啥的。多半大家是默认它满足,然后就开始套着用,这是比较危险的,如果assumption不满足,或者数据中有outliers,都可能会导致错误的结论。

今年,用到了免疫组化,实验结果的量化是由医生给出来打分值,癌组织和癌旁组织两组数据,免疫组化的数据不可能用参数统计,这个结果我就是用Wilcoxon signed rank test去做检验。文中所提出的其它非参统计方法,全都不会。囧

关于相关性,这里有篇文章,比较了Pearson和Spearman: (Kowalski, 1975)

还有文章说用Kendall’s tau比Spearman’s Rho要好: Newson R. . Stata Journal 2002; 2(1):45-64.

虽然作者强调非参统计,但是如果数据分布满足参数统计的assumption的话,还是用参数统计好,更加powerful。这个可能需要我们在做统计之前对数据分布做一下检验。不过正态检验其实用处也不大,小样本的话,不够powerful,大样本的话,即使不是正态分布,t-test和ANOVA也是很robust的。

如果不满足参数统计检验的话,也不一定就得用非参统计,不负责任地说,可能用效果还更好。

2. R (or I guess S). R is a cranky, odd statistical language/system with
a great scientific plotting package. Its a package written mainly by
statisticians for statisticians, and is rather unforgiving the first
time you use it. It is defnitely worth persevering. It’s basically a
combination of excel spreadsheets on steriods (with no data entry. an
Rdata frame is really the same logical set as a excel workbook - able to
handle millions of points, not 1,000s), a statistical methods compendium
(it’s usually the case that statistical methods are written first in R,
and you can almost guarantee that there are no bugs in the major
functions - unlike many other scenarios) and a graphical data
exploration tool (in particular lattice and ggplot packages). The syntax
is inconsistent, the documentation sometimes wonderful, often awful and
the learning curve is like the face of the Eiger. But once you’ve met
p.adjust(), xyplot() and apply(), you can never turn back.

实在是太好用了,习惯用矢量运算之后,我就很少用perl了。不过学生物的,我所见过的人,能用好excel的人不多(我也不会用-,-),会用SPSS的人非常少,SAS从没见过有人用。每次我告诉身边的人,我用的是R,几乎都没人听说过的。在国内,目前主要也就高校里有人用。但至少做生信的,是需要学R的,上面那一大堆的软件包,已然是无法回避。

学生物的人都喜欢有图形界面的软件,像spss这种,点菜单无非是为了选参数,而R这种,变成敲键盘而已,一样的。用编程语言比用分析软件要好,可自动化,而且有利于交流(看一下代码,就知道都干了些什么),像SPSS这种把很多分析模块化,点点鼠标就能把回归模型算出来,固然是好,但是现代的数据分析,已经很少有问题是点个鼠标就能解决的了。

至于画图,文中提到lattice和ggplot,lattice应该是目前R上面最复杂的图形包,功能比ggplot要强得多,画图速度也比ggplot要快,不过我没用过。只学了ggplot,因为ggplot的语法更加human friendly,我觉得,学了ggplot后,都会爱上画图的 =,=

3. The problem of multiple testing, and how to handle it, either with
the Expected value, or FDR, and the backstop of many of piece of
bioinformatics - large scale permutation. Large scale permutation is
sometimes frowned upon by more maths/distribution purists but often is
the only way to get a sensible sense of whether something is likely “by
chance” (whatever the latter phrase means - it’s a very open question)
given the complex, hetreogenous data we have. 10 years ago perhaps the
lack of large scale compute resources meant this option was less open to
people, but these days basically everyone should be working out how to
appropriate permute the data to allow a good estimate of
“surprisingness” of an observation.

高通量的组学数据,变得越来越常见,pvalue算的是犯一类错误的概率,组学数据观测点多,而重复少,noise很多,如果单纯卡个pvalue,越高通量的数据,犯二类错误的概率会更大,假阳性没有得到控制。这个越来越重要,这周去给学生上课,我还专门讲了Bonferroni Method、Benjamini-Hochberg Method还有q-value,不过好像我讲的时候,学生都没啥兴趣,或许有一天,他们写文章,reviewer要求给出FDR的时候,希望还能记起。

4. The relationship between Pvalue, Effect size, and Sample size This
needs to be drilled into everyone - we’re far too trigger happy quoting
Pvalues, when we should often be quoting Pvalues and Effect size. Once a
Pvalue is significant, it’s higher significance is sort of meaningless
(or rather it compounds Effect size things with Sample size things, the
latter often being about relative frequency). So - if something is
significantly correlated/different, then you want to know about how much
of an effect this observation has. This is not just about GWAS like
statistics - in genomic biology we’re all too happy about quoting some
small Pvalue not realising that with a million or so points often, even
very small deviations will be significant. Quote your r2, Rhos or
proportion of variance explained…

从没接触过GWAS,不知道是怎么算的,从文中的描述看,这里讲的是power
analysis,这个对实验设计有用,可以估计sample size。当然如果sample
size已确定,那么设定pvalue和power,可以计算effect size,就是说,实验可以detect出多大的effect。或者知道sample size,effect size, pvalue,可以计算power,就是说effect存在的话,有多大的概率可以detect出来。这和pvalue不一样,pvalue算的是没有effect的概率。

power analysis就是四个变量,颠来倒去,知道三个,算第四个。

差异越大,越容易检测出来,样本间的variance越大,就需要更大的样本量,来排除样本间差异所带来的干扰。差不多就是这样一些东西。

5. Linear models and PCA. There is a tendency often to jump to quite
complex models - networks, or biologically inspired combinations, when
our first instinct should be to crack out the well established lm()
(linear model) for prediction and princomp() (PCA) for dimensionality
reduction. These are old school techniques - and often if you want to
talk about statistical fits one needs to make gaussian assumptions about
distributions - but most of the things we do could be either done well
in a linear model, and most of the correlation we look at could have
been found with a PCA biplot. The fact that these are 1970s bits of
statistics doesn’t mean they don’t work well.

PCA就是把高维空间映射到低维空间,在保留尽可能多信息的情况下进行降维处理, 下面这段解释了linear model和PCA之间的不同之处:

One may also see PCA as an analogue of the least squares method to
find a line that goes as “near” the points as possible – to simplify,
let us assume there are just two dimensions. But while the least
squares method is asymetric (the two variables play different roles:
they are not interchangeable, we try to predict one from the others,
we measure the distance parallel to one coordinate axis), the PCA is
symetric (the distance is measured orthogonally to the line we are
looking for).

John Mark在评论里写道,进阶还需要学什么,一并记录下来。

The next level - number 6 - would be to get beyond P values, and
instead compute probability distributions of the quantities of
interest. This leads naturally to number 7, which is to delve into the
generative models that are currently solved by MCMC methods. This is
basically the Bayesian approach. Just as an aside “non parametrics” in
some new work is also used to mean models where the number of
parameters varies, as a consequence of the method.


PS:我自己被抄的,都算有标明出处,当然都是直接copy-paste的,但有些标的是“整理自”,这用词也是有问题的。在这两天里,也有几个人跟我说,被他们抄了,没标明出处的。我自己也有发现他们抄了标原创的。我自己被抄的,也就这三篇,晒完,他们那破事,我也不管了。该写点新东西跟大家分享了。

PS2: 保持日更还是非常费时间的,而且微信的排版也挺费时。日更大家也不可能天天看,我也发现自己好像费太多时间在上面了,所以要克制一下,改为周更,多谢大家支持。





您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存