This post is about a really interesting new paper, but I’m going to start by providing a bit of background.

In theory, p-values and confidence intervals are exactly what we need to keep us honest in science. We want scientists to make erroneous claims with confidence only rarely, and p-values and confidence intervals are designed to make this happen. However, they are designed in such a way that assumes we are all ‘behaving’, ‘playing fair’, and ‘following the rules’. IF a particular type of study is conducted in accordance with the assumptions of a particular statistical test, THEN the associated p-values will only let us make erroneous claims with confidence 5% of the time. This is exactly what we want as scientists, I think.

BUT…scientists don’t play by the rules. For example, we’ll look at the data to decide what predictors to include in our statistical model, and then use that statistical model to make inferences about the data. This would all be fine if this kind of snooping was accounted for by the theory of the particular statistical test being used, but this usually isn’t the case. Usually, we informally try to find the most interesting relationships suggested by the data, and then test those. Well OBVIOUSLY tests are going to look better than they really are and your going to start making erroneous claims with confidence more than 5% of the time. This is what is called model selection bias — the bias that arises when we decide on our statistical test using information in the data themselves. It can be a psychological thing, but it can also be built into standard methods of inference (e.g. AIC, F-tests of homogeneity before an ANOVA, etc…). In short: model selection bias is a huge and widespread problem.

One of the problems with model selection bias is that its really hard to quantify how big of a problem it really is. The data required to assess the magnitude of the problem are large numbers of p-values and confidence intervals all calculated while studying the same phenomenon, which is now very well understood, but wasn’t well understood when all those confidence intervals and p-values were being calculated. With data like this we can go and actually check the frequency with which scientists made erroneous claims with confidence, because we would now know what claims were erroneous. No such data set exists in ecology — I hope I’m wrong about this, but I know I’m not. However, in physics such data do exist, and people have actually used them to measure the rate at which physicists actually make erroneous claims with confidence! Amazing eh? In physics, there are things called ‘fundamental constants’. I know I know, its a strange concept for us ecologists to understand, but in physics they actually exist. For example, the speed of light, the fine structure constant (not that I really understand much about these things at all, except that they exist). People have been trying to measure these things for a while, and for many of them we are so good at it now that we actually know their values up to very large numbers of significant figures. Over the years people put 98% confidence intervals on these constants, and typically 20-40% of these intervals failed to contain what we now know is the true value (note that 20-40% is much greater than the theoretical 2%). And I don’t know about you, but I suspect the problem may be worse in ecology (to say the least).

But new research in theoretical statistics may have a ‘magical’ solution to this problem. OK I’m not going to claim that I completely understood this paper. When they started talking about things like polytopes and scheffé balls, I knew that the level of mathematics is beyond me. But the main claim is easy to understand, and extremely exciting. They claim that they have a new correction for confidence intervals (it makes them wider) and p-values (it makes them larger) that controls for (and this is the amazing part) ALL forms of model selection bias. So, as I understand their claim, if you went in there after an experiment and said something like, ‘oh whoops…looks like variable X actually shouldn’t have been measured after all,’ on the grounds that it didn’t make sense based on some combination of natural history and data snooping, ‘so we’ll just forget we ever measured it,’ THEN: you would still get correct p-values and confidence intervals on the parameters that were retained in the model IF: you use their method! That’s pretty amazing to me, and could be pretty game changing.

As with all new theoretical ideas, I’m skeptical a bit. There are a number of caveats in there, some of which I don’t understand as a result of my lack of mathematical knowledge, that I think may limit the practical applicability of the approach. The one that stuck out mostly for me is that they assume that you have an ‘independent’ measure of the residual variance. I put quotes on ‘independent’, not because the authors don’t define this concept, but because I don’t fully understand their particular meaning of it yet. So you can’t do the usual thing of just calculating the variance of the observed residuals, you have to ‘know’ somehow what the residual variance should be. The way I’m putting this issue sounds insurmountable, but they have actually given this problem a lot of thought and may have some good practical proposals, which I don’t understand yet.

Caveats aside…I’m extremely encouraged that theoretical statisticians are now considering model selection bias in detail, and am really looking forward to new developments in this area.

1. November 13, 2012 3:36 pm

Sounds really intriguing, thanks for posting on it. I can see that the described situation might be quite important when one has little or no clue, apriori, about what the cause-and-effect relationships are in the set of variables you’ve measured (or otherwise have data for)–as for example in strict data mining or pattern detection exercises. That is, whenever you’re just trying to get a feel for whether there might be anything meaningful. But other than that, researchers often have a pretty good idea about what affects what, and so they tend to measure those things, and more interested in an estimate of the effect size than in trying to convince themselves that there is in fact a relationship.

2. November 14, 2012 3:37 am

Thanks for the comment Jim. You’re right that this new methodology is most useful whenever you’re doing heavily exploratory analysis with liberal amounts of data snooping. However, as I pointed out with the particle physics example, even when we have clear targets for estimation, confidence intervals often don’t behave as advertised. My hope is that new methods like this will help us to generate more accurate confidence intervals and p-values even when we don’t ‘play by the rules’.

• November 14, 2012 5:55 pm

Yes Steve, you did indeed point that out. I guess I’m just not sure how some of these constants in physics are estimated–it’s a little hard for me to see how the principle applies in those cases. But it’s an interesting and important topic for sure. Sort of points up to me how easily (and/or frequently) we take certain things for granted in science. It’s not a topic I’d thought a lot about. In light of this, I wonder: how do we ever come to any sense of confidence in what the true value of any particular parameter is?

• November 17, 2012 1:34 pm

Sorry for the late reply…but I wonder this myself. Try our best?

In science we need to strike a balance between being over- and under-confident. Its easy to not be proved wrong in the future…just make your interval estimates really wide. But this is being under-confident. In contrast, its also not good to make your intervals too short…that’s being over-confident (like the physicists in this example). I meant the physics example to illustrate that there is probably still work to do on achieving our desired level of confidence.

In ecology things are even harder, because we’re often in a situation that Tukey called uncomfortable science. Not long after we observe a sample from a statistical population, that population changes its characteristics (to some extent). Example: we estimate the number of elk in a government park, and then by next year some individuals have died, and some have been born. We will never come to KNOW the true number of elk during the year that we sampled, and therefore cannot check to see how good our confidence intervals are over a range of similar studies.

So we’re back to your question: “how do we ever come to any sense of confidence in what the true value of any particular parameter is?”