Skip to content

What is science? I’d rather not say, thank you

June 22, 2013

This blog is mostly about remembering things I’ve read or figured out. For example, this post is just a collection of philosophy of science I’ve been reading/watching lately.

It seems important to be able to distinguish science from non-science (e.g. should creationism be taught in schools?). I like Susan Haack‘s take on this, which is that its more important to distinguish serious and honest thinking from partisan and dishonest thinking. Rather than just saying, well creation science is actually just pseudoscience, Haack would rather people point out specifically what is wrong with creation science. She talks about Popper trying to demarcate science and non-science by emphasizing falsification. Her take is that, well he’s right that honest self-criticism is the mark of good science, but its also the mark of any serious and honest thinking. There are plenty of scientists who don’t honestly self-criticize and there are plenty of non-scientists who do. I like her strategy of defending science, within reason. What is science? Haack would rather not say.

Speaking of Popper, I’ve recently checked out his myth of the framework. In it he criticizes claims that productive discussions are impossible unless the discussants all share the same framework (i.e. share vocabulary/worldview). He claims to the contrary that productive discussions are more likely among people with different frameworks. This reminds me of all the high-fiving that can happen at plenary sessions at evolutionary biology conferences. I really like his point that discussions and debates aren’t for winning…they are for exposing the weaknesses of everyone’s positions. This makes a lot of sense to me. I always really appreciate generosity from those with whom I argue. I’m often criticized as being too passive in person, but I think this is a good thing. I’m not trying to win, I’m trying to figure things out. Not that I can’t slip into trying to win — I’m trying to make a living at research. In fact, in print I often tend to be a little too aggressive that reviewers rightly get annoyed with my unjustified tone of confidence. Note to anyone who reviews my work: trust me, I respect your position and am just trying to push the issue. Totally agree Popper…discussions shouldn’t be about winning — not to say that certain positions can’t bore us sometimes or just be clearly wrong.

The myth of the framework makes me wonder about peer review. We often feel like we are unfairly judged because the reviewer doesn’t ‘get it’ (i.e. they have a different framework). I think Einstein often felt this way, right? The problem here, I think, is that peer review isn’t enough of a conversation. When talking across frameworks you need to be able to respond to questions and make clarifications. But there’s too much of an imbalance of power in peer review. On the other hand, if you get reviewers in your framework, and don’t make any really bad mistakes, then you’ll probably get a good review even if your work isn’t that interesting to others outside of your framework. Just some thoughts…I’ve got no solutions, as is usual with discussions of peer review.

Speaking of talking across disciplines, philosopher Nancy Cartwright has been doing some interesting work on randomized control trials in public health/medicine. Apparently (3:56) she has been having trouble with researchers outside of her field. Still, I really like her point that RCTs *can* be used to discover the causes of things, but that those causes are usually (always?) local. That is, we *can* discover that a particular treatment causes a particular benefit in a particular village, but that stronger assumptions than the usual ones associated with RCTs are necessary for determining if this treatment will also cause the same benefit in another village. Its essentially the extrapolation problem in terms of causation in RCTs. It reminded me of a similar problem with information criteria pointed out by another philosopher.

I really liked Cartwright’s ‘How the Laws of Physics Lie’ when I read it a while ago. A central theme for Cartwright is that scientific knowledge is local. That is, even if there are ‘universal laws’ (a claim that Cartwright isn’t so sure about), we still need a good understanding of the particular place and time in which we try to apply such laws (e.g. sure gravity is a general principle, but if you don’t consider air resistance you’ll miscalculate the rate at which a feather falls to the ground). This is especially true in sciences like economics, public health, and ecology. Cartwright argues that we not only need good general theories, but also good local theories. Local theories are things like an understanding of the culture of a particular village, or the pollution history of a particular ecological study site. This is why I’m generally a supporter of natural history, something I don’t do. Jeremy Fox, for example, agrees but prefers natural history to be something that needs to be done but not written about in our best journals. I don’t quite agree but understand where he’s coming from and still don’t quite understand what rubs me the wrong way.

In any case, I think that creating good local theories is very important, maybe because that’s essentially what my own research concerns. Observational descriptive studies are important because they teach us about the kinds of states that ecosystems, populations, and communities actually find themselves in. I interpret Cartwright as justification for needing both rigorous causal studies (e.g. RCTs, ecological microcosms) and observational descriptive studies (well-parameterized empirical models). These aren’t novel ideas I’m having, but Cartwright brings new excitement to them for me. I like the way Cartwright refers to ‘local theories’ (an example of which I take to be observational statistical modeling) rather than the more pejorative ‘descriptive studies’.

Enough rambling, but I’m glad to get that off my chest.


Keeping prediction intervals positive

June 21, 2013

Often in ecology our response variables are strictly positive
(e.g. population density). As a result some standard default
proceedures won't work well. For example, here are some data,

x <- rnorm(10)
y <- exp(-x^2 - 2 + rnorm(10))
df <- data.frame(x, y)

Here's a standard ggplot exploration,

ggplot(df) + geom_point(aes(x, y)) + geom_smooth(aes(x, y), method = "loess")


But wait, that's not right, the prediction interval dips below zero!

The best way to deal with this problem is to use models that account
for bounded data censored responses (so-called left-censored in this case). But if you just want to do some exploration and be quick about it, here's a
simple-minded solution that works well. Start by log-transforming (you might need log + 1) the response and fitting a loess model to the transformed data,

m <- loess(log(y) ~ x)

Then use this hidden ggplot function for getting prediction
information about the model fit,

pdf <- ggplot2:::predictdf(m, seq(min(x), max(x), length = 100), TRUE, 0.95)

Then back-transform,

pdf <- within(pdf, {
    y <- exp(y)
    ymin <- exp(ymin)
    ymax <- exp(ymax)

and re-plot,

ggplot(pdf) + geom_path(aes(x, y), colour = "blue") + geom_ribbon(aes(x, ymin = ymin, 
    ymax = ymax), alpha = 0.2) + geom_point(aes(x, y), data = df)


That's better!

Calculating the PRESS statistic in R

June 18, 2013

Engineers tend to use a version of the residual sum of squares (RSS) called PRESS, for predictive RSS. The idea is that RSS describes how well a linear model fits the data to which it was fitted, but PRESS tells you how well the model will predict new data.

Let's make up some data and fit a model,

n <- 10
x <- rnorm(n)
y <- rnorm(n)
(m <- lm(y ~ x))
## Call:
## lm(formula = y ~ x)
## Coefficients:
## (Intercept)            x  
##      -1.037       -0.551

Here are the residuals,

(r <- resid(m))
##        1        2        3        4        5        6        7        8 
## -0.51730 -0.19795  0.61276 -1.76696  1.45084  0.06266  1.01057 -0.29810 
##        9       10 
## -0.35962  0.00310

And here are the predictively adjusted residuals,

(pr <- resid(m)/(1 - lm.influence(m)$hat))
##         1         2         3         4         5         6         7 
## -0.742199 -0.220755  0.689584 -2.066720  1.614070  0.115833  1.231767 
##         8         9        10 
## -0.337332 -0.530903  0.003685

There is some theoretical magic that makes this equal to the cross-validated residuals. So the regular RSS is,

## [1] 7.153

and the PRESS is,

## [1] 9.878

which is bigger because predicting is harder than fitting.

Package binaries for mac on r forge are for R 2.15.3 still

June 18, 2013

This could be frustrating. But it looks as though it will only be temporary. I wonder how temporary?

The Bayesian puppies are at it again

June 17, 2013


I’m neither a Bayes booster nor a Bayes naysayer. But the Bayesian puppy guy‘s got a really cool post on robust analysis of a classic artificial data set in statistics.

Footnotes in latex tables

June 12, 2013

Apparently putting footnotes in tables is considered bad practice, and so latex doesn’t let you do it. There are a number of workarounds, but here’s my favourite: just use the \begin{longtable}...\end{longtable} environment in the longtable package instead of \begin{tabular}...\end{tabular}. This has the added bonus of allowing you to create long tables that run over more than one page.

Probability and statistics lesson from kids in the hall

June 2, 2013

If you’re not Canadian of a certain vintage this might not mean much to you. But at least its more on topic than the last post about disco.

Soul song pick of the week: Stand on the Word (Larry Levan mix)

May 31, 2013


Jerry Coyne’s been posting on some amazing! Soul music recently, so I thought I’d take this opportunity to play one of my all-time favourites: Stand on the Word (Larry Levan mix) — Joubert Singers. I’ve always thought that the way the term ‘Soul Music’ is used is far too narrow — lotta music got soul. This one by Larry Levan and the Joubert Singers from 1982 is one example of soul music that isn’t Soul Music.

I thought it would be fun to play a blatantly religious tune, given that this is kind of a response to Jerry Coyne. I’m not a believer but I have to say that so much of the music I really love is anything but secular.

Another wonderful thing about this tune is its significance for the history of gay culture, and ‘unity culture’ (don’t know if that’s a thing) in general. Larry Levan used to play at a club in NYC called the Paradise Garage that played a very important role in giving gay people and other ‘deviants’ a place to feel welcome.

Musically I love how its not over produced at all and how the singing at the most emotional part of the song is really out of key (1:57). Singing on key takes training (boring!), singing with soul takes soul. Actually I’m not sure you can pin down *the* most emotional part. Could be 1:30 when when the choir sings “that’s how the good lord works!” Could be at 1:48 for “we must not question the good lord…”. Oh but 1:13 is so amazing too, when the bass line comes in for “that’s how he works, that’s how, the good lord, he works.” Oh and the tempo…what a tempo. Blah I love it!

If you have trouble feeling this one, just try to put yourself in the position of the people who were a part of this scene ( Many of them were outsiders who only felt like they belonged when they were at the Garage.

Saturday Night Fever sucks, but disco is a powerful force — its a historical fact. Think ‘I will Survive’ kinda stuff.

Here it is:

Warning: family()$aic is not what you think!

April 27, 2013

Or at least not what I thought. Some background: R has this thing called family objects. They are meant to provide model fitting functions (e.g. glm, glmer) information about the assumed error distribution (e.g. binomial, Poisson, Gaussian). They have several components that return functions or expressions for computing various values that depend on the chosen family. One of those functions is for computing AIC, and you get it, for the binomial family say, by doing this:

> binomial()$aic
function (y, n, mu, wt, dev) 
    m  1)) 
    else wt
    -2 * sum(ifelse(m > 0, (wt/m), 0) * dbinom(round(m * y), 
        round(m), mu, log = TRUE))

OK, so what’s my problem? Well, if you look carefully at this function…you’ll find that it just returns minus twice the log-likelihood. Well minus twice the log-likelihood is not AIC, its deviance. The AIC is the deviance + twice the number of parameters. To confirm my suspicion, here’s some code from glm:

aic <- family$aic

So aic contains the AIC function for the particular family being assumed. OK, a bit later the AIC of the model is calculated:

## calculate AIC
    aic.model <-
	aic(y, n, mu, weights, dev) + 2*rank

Note how they have to add the 2*rank bit (rank is just a fancy word for number of parameters…basically). So the AIC function from family objects don’t return AIC! OK this is very confusing to me. There’s probably a good reason for this, but I don’t know…its a little fishy.

Just to avoid people further confusion, the AIC function does actually compute AIC, thankfully. And this is the function that most non-power-users will use. But as a power-user, I’m frustrated that family()$aic doesn’t return a function that returns AIC!

Trying to remember the rule that if you rate poorly the code of someone smarter than you, you’re probably missing something. I do believe this, its just frustrating that family()$aic doesn’t return a function that returns AIC!

Difference between now and one, two, three, etc. commits ago

April 19, 2013

Its been a while since I’ve made a purely technical post. Here’s one that I just don’t want to forget.

Often I want to see how a file is different from say three git commits ago. This stackoverflow gives me exactly want I want. To summarize, here’s how you can get these diffs from a terminal for one, two, three, and four commits ago:
git diff HEAD~1 file
git diff HEAD~2 file
git diff HEAD~3 file
git diff HEAD~4 file