# A common, common, common mistake in rarefaction analysis

Beware! Whenever confidence bands around a rarefaction curve vanish at the largest sample size, it is virtually guaranteed that the rarefaction curve was constructed incorrectly. Here is one of many many many examples. This example claims to plot 95% confidence intervals, but in fact intervals constructed in this way will contain the true expected species richness much less than 95% of the time.

What’s going on? Let be the sample size of the study. In such a study, the rarefaction curve ends at . Therefore, we only observed a single sample of size , so we are very unsure about what the expected species richness is in a sample of size . Therefore, the confidence intervals should be very wide at the largest sample size on the curve. Nevertheless, it is very common to see the narrowest confidence intervals at maximum sample sizes. Here‘s another example of the mistake. I could go all day giving different examples. Here‘s another one.

Why do people make this mistake? I’d love to get into it but don’t have the time.

A key term in all of this is ‘unconditional variance’. The wrong rarefaction curves use ‘conditional variance’ to construct confidence envelopes. This word ‘conditional’ refers to the fact that the variance is constructed assuming that all of the species in the assemblage have been found — i.e. we *condition* our variance estimates on the number of species observed. But we know there are more species out there, otherwise we wouldn’t be doing rarefaction in the first place. Therefore, we want our variance estimators to be *unconditional*. If this seems confusing, it is. Neither have I explained it well, nor is it easy to understand even if explained well. To understand it, I recommend:

Colwell, Robert K., et al. “Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages.” Journal of Plant Ecology 5.1 (2012): 3-21.

Here’s a teaser from this paper:

…unconditional variance expressions assume that the reference sample represents a random draw from a larger (but unmeasured) community or species assemblage, so that confidence intervals for rarefaction curves remain ‘open’ at the full-sample end of the curve. In contrast, traditional variance estimators for rarefaction (e.g. Heck et al. 1975; Ugland et al. 2003) are conditional on the sample data,so that the confidence interval closes to zero at the full-sample end of the curve, making valid comparisons of curves and their confidence intervals inappropriate for inference about larger communities or species assemblages.

Maybe this is common because the usual statistical software just provides this type of confidence bands…? Including {vegan}? Time to hack the sources! (Much easier said than done, sadly.)

You are probably right. I wish I had the time!

Can I ask you which non-R program you like best for plotting rarefaction curves? I know my Q doesn’t really have anything to do with what you wrote about (CI around rarefaction curves)…just curious. Tnx

I pretty much only use R these days. Sorry.

Late update to this. Chao has apparently been developing a package in R to give you “correct” confidence intervals for extrapolation curves. http://chao.stat.nthu.edu.tw/blog/software-download/inext-r-package/