Warning: family()$aic is not what you think!
Or at least not what I thought. Some background: R has this thing called family objects. They are meant to provide model fitting functions (e.g. glm, glmer) information about the assumed error distribution (e.g. binomial, Poisson, Gaussian). They have several components that return functions or expressions for computing various values that depend on the chosen family. One of those functions is for computing AIC, and you get it, for the binomial family say, by doing this:
> binomial()$aic
function (y, n, mu, wt, dev)
{
m 1))
n
else wt
-2 * sum(ifelse(m > 0, (wt/m), 0) * dbinom(round(m * y),
round(m), mu, log = TRUE))
}
OK, so what’s my problem? Well, if you look carefully at this function…you’ll find that it just returns minus twice the log-likelihood. Well minus twice the log-likelihood is not AIC, its deviance. The AIC is the deviance + twice the number of parameters. To confirm my suspicion, here’s some code from glm:
aic <- family$aic
So aic contains the AIC function for the particular family being assumed. OK, a bit later the AIC of the model is calculated:
## calculate AIC
aic.model <-
aic(y, n, mu, weights, dev) + 2*rank
Note how they have to add the 2*rank bit (rank is just a fancy word for number of parameters…basically). So the AIC function from family objects don’t return AIC! OK this is very confusing to me. There’s probably a good reason for this, but I don’t know…its a little fishy.
Just to avoid people further confusion, the AIC function does actually compute AIC, thankfully. And this is the function that most non-power-users will use. But as a power-user, I’m frustrated that family()$aic doesn’t return a function that returns AIC!
Trying to remember the rule that if you rate poorly the code of someone smarter than you, you’re probably missing something. I do believe this, its just frustrating that family()$aic doesn’t return a function that returns AIC!
Difference between now and one, two, three, etc. commits ago
Its been a while since I’ve made a purely technical post. Here’s one that I just don’t want to forget.
Often I want to see how a file is different from say three git commits ago. This stackoverflow gives me exactly want I want. To summarize, here’s how you can get these diffs from a terminal for one, two, three, and four commits ago:
git diff HEAD~1 file.r
git diff HEAD~2 file.r
git diff HEAD~3 file.r
git diff HEAD~4 file.r
In a fascinating article on measurement in biology, David Houle and others begin with this,
PROGRESS IN science often involves quantification. Ideas progress from loose verbal accounts to become rigorous mathematical models. At the same time, concepts and entities progress from incomplete verbal definitions to become variables and parameters that derive their meaning from a precise theoretical context. For example, ecology developed from an unconnected set of verbal ideas in 1920 to a unified science with a rigorous foundation in mathematical population ecology by about 1970 (Kingsland 1985), and similar changes took place during the modern synthesis in evolutionary biology. Arguably, empirical progress results mainly from better measurement. Measurements improve either because of more rigorous theory, which defines what is important to measure, or better instruments, which allow more accurate measurements of familiar quantities or make the previously unknown measurable. The high status of quantification in science is therefore no surprise, nor is the desire of researchers to support their work with quantitative measurements.
Did you catch that? They put ecology *before* evolution: `rigorous foundation in mathematical population ecology’, and then…oh ya…then there’s this evolutionary biology thing…modern synthesis blah blah. That definitely doesn’t fit with the usual assumption that ecology lags behind evolution in its theoretical content. And did he say ‘unified’ and ‘ecology’ in the same sentence? I wonder if this paragraph structure was intentional, or if it could have just as easily started with the modern synthesis and then briefly mentioned ecological theory as a second example?
I don’t use R’s density function anymore
Here is an example of why:
Simulate some data from a mixture of binomial distributions:
set.seed(1)
n <- 100 # sample size
ub <- 1000 # upper bound
p1 <- 0.8 # probability
p2 <- 0.99 # probability
X <- rbinom(n, ub, sample(c(p1, p2), size = 100, replace = TRUE))
These data come from this density function:
xlim <- ub * c(0.7, 1.05)
xseq <- seq(xlim[1], xlim[2], 1)
den_tr <- 0.5 * dbinom(xseq, ub, p1) + 0.5 * dbinom(xseq, ub, p2)
par(cex.lab = 2, cex.main = 2.2, cex.axis = 1.8, mar = c(5, 6, 2, 1))
plot(xseq, den_tr, type = "l", main = "True density", ylab = "", xlab = "X")
abline(v = ub)
The vertical line gives the maximum possible value for the random variable.
Now try estimating the density using three methods
den_hs <- hist(X, plot = FALSE)
den_st <- density(X)
library(logspline)
den_ls <- logspline(X, lbound = 0, ubound = ub)
## Warning: too much data close together
## possible infinite density at upper end
## running program with fewer knots
## Warning: re-ran with oldlogspline
par(cex.lab = 2, cex.main = 2, cex.axis = 1.8, mar = c(5, 6, 2, 1))
plot(den_ls, xlim = xlim, main = "logspline estimate", ylab = "", xlab = "X")
abline(v = ub)
In this example, the logspline function does much better than the default settings for density or hist functions.
I'm not saying that logspline will always be best, but it does have some pretty sweet features. In particular, you can set upper and lower bounds on the random variable (zero and 1000 in this case). Also, logspline provides nice info about the fit of the model (because that’s what density estimates are…they’re models),
den_ls
## knots A(1)/D(2) loglik AIC minimum penalty maximum penalty
## 3 2 -568.1 1145.5 270.20 Inf
## 4 2 -433.0 879.9 4.44 270.20
## 5 2 -430.8 880.1 1.94 4.44
## 6 2 -429.9 882.7 1.61 1.94
## 7 2 -429.1 885.7 0.01 1.61
## 8 1 -429.0 890.3 0.00 0.01
## the present optimal number of knots is 4
## penalty(AIC) was the default: BIC=log(samplesize): log( 100 )= 4.61
See here for more info.
Off topic post: I’ve been arguing (over drinks) for a few years now that there’s no evidence for free will, and I get a lot of push back from my scientist and non-scientist friends. I know there’s an extremely large philosophical literature on this topic, but I don’t have time to read it. So my ideas in this area are mostly based on intuition. Still, I was pleased today to see Jerry Coyne apparently agreeing with me:
I am confident, as is Dennett, that one day consciousness will be explained by reductive physical analysis. It’s a hard problem, but many problems once thought insuperable have yielded to scientific study. For many years everyone believed in dualistic free will, but now we’re beginning to understand that our sense of “agency” is bogus, and that in fact our “free” choices may be largely or completely predicted from our genes and our environments, i.e., physical phenomena. Our feeling of agency is, of course, one aspect of our consciousness. This doesn’t explain the source of that feeling, but it’s entirely possible that agency is a confabulation installed in us by evolution to help make sense post facto of what we do, or relate our experiences to others.
How to choose colours for scientific graphs
My supervisor Ben Bolker just pointed me to an absolutely amazing website for choosing colours for scientific graphs:
It is conservative to assess significance by the degree of overlap of confidence intervals
Here Here is a really interesting paper with a very simple message:
1. When two confidence intervals do not overlap, then the two parameters are significantly different.
2. When two confidence intervals overlap, then the two parameters may or may not be significantly different.
In other words, when there is no overlap, you don’t really need to test for significance. But when there is overlap, you do need to test for significance.
Hope I’m not oversimplifying.
A common, common, common mistake in rarefaction analysis
Beware! Whenever confidence bands around a rarefaction curve vanish at the largest sample size, it is virtually guaranteed that the rarefaction curve was constructed incorrectly. Here is one of many many many examples. This example claims to plot 95% confidence intervals, but in fact intervals constructed in this way will contain the true expected species richness much less than 95% of the time.
What’s going on? Let be the sample size of the study. In such a study, the rarefaction curve ends at
. Therefore, we only observed a single sample of size
, so we are very unsure about what the expected species richness is in a sample of size
. Therefore, the confidence intervals should be very wide at the largest sample size on the curve. Nevertheless, it is very common to see the narrowest confidence intervals at maximum sample sizes. Here‘s another example of the mistake. I could go all day giving different examples. Here‘s another one.
Why do people make this mistake? I’d love to get into it but don’t have the time.
A key term in all of this is ‘unconditional variance’. The wrong rarefaction curves use ‘conditional variance’ to construct confidence envelopes. This word ‘conditional’ refers to the fact that the variance is constructed assuming that all of the species in the assemblage have been found — i.e. we condition our variance estimates on the number of species observed. But we know there are more species out there, otherwise we wouldn’t be doing rarefaction in the first place. Therefore, we want our variance estimators to be unconditional. If this seems confusing, it is. Neither have I explained it well, nor is it easy to understand even if explained well. To understand it, I recommend:
Colwell, Robert K., et al. “Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages.” Journal of Plant Ecology 5.1 (2012): 3-21.
Here’s a teaser from this paper:
…unconditional variance expressions assume that the reference sample represents a random draw from a larger (but unmeasured) community or species assemblage, so that confidence intervals for rarefaction curves remain ‘open’ at the full-sample end of the curve. In contrast, traditional variance estimators for rarefaction (e.g. Heck et al. 1975; Ugland et al. 2003) are conditional on the sample data,so that the confidence interval closes to zero at the full-sample end of the curve, making valid comparisons of curves and their confidence intervals inappropriate for inference about larger communities or species assemblages.
Freshwater ecology database!
This looks awesome!
http://edild.github.com/blog/2012/12/25/taxize-freshwaterecology/
New web form for submitting CRAN packages
http://cran.r-project.org/submit.html
I like this idea. The old email approach always seemed a little too informal.
Apparently its in beta testing.



