Skip to content

I don’t use R’s density function anymore

March 19, 2013

Here is an example of why:

Simulate some data from a mixture of binomial distributions:

set.seed(1)
n <- 100  # sample size
ub <- 1000  # upper bound
p1 <- 0.8  # probability
p2 <- 0.99  # probability
X <- rbinom(n, ub, sample(c(p1, p2), size = 100, replace = TRUE))

These data come from this density function:

xlim <- ub * c(0.7, 1.05)
xseq <- seq(xlim[1], xlim[2], 1)
den_tr <- 0.5 * dbinom(xseq, ub, p1) + 0.5 * dbinom(xseq, ub, p2)
par(cex.lab = 2, cex.main = 2.2, cex.axis = 1.8, mar = c(5, 6, 2, 1))
plot(xseq, den_tr, type = "l", main = "True density", ylab = "", xlab = "X")
abline(v = ub)

unnamed-chunk-54

The vertical line gives the maximum possible value for the random variable.

Now try estimating the density using three methods

den_hs <- hist(X, plot = FALSE)
den_st <- density(X)
library(logspline)
den_ls <- logspline(X, lbound = 0, ubound = ub)
## Warning: too much data close together
## possible infinite density at upper end
## running program with fewer knots
## Warning: re-ran with oldlogspline
par(cex.lab = 2, cex.main = 2, cex.axis = 1.8, mar = c(5, 6, 2, 1))
plot(den_ls, xlim = xlim, main = "logspline estimate", ylab = "", xlab = "X")
abline(v = ub)

unnamed-chunk-53

unnamed-chunk-52

unnamed-chunk-51

In this example, the logspline function does much better than the default settings for density or hist functions.

I'm not saying that logspline will always be best, but it does have some pretty sweet features. In particular, you can set upper and lower bounds on the random variable (zero and 1000 in this case). Also, logspline provides nice info about the fit of the model (because that’s what density estimates are…they’re models),

den_ls
##  knots A(1)/D(2) loglik    AIC minimum penalty maximum penalty
##      3         2 -568.1 1145.5          270.20             Inf
##      4         2 -433.0  879.9            4.44          270.20
##      5         2 -430.8  880.1            1.94            4.44
##      6         2 -429.9  882.7            1.61            1.94
##      7         2 -429.1  885.7            0.01            1.61
##      8         1 -429.0  890.3            0.00            0.01
## the present optimal number of knots is  4 
## penalty(AIC) was the default: BIC=log(samplesize): log( 100 )= 4.61

See here for more info.

Advertisements
3 Comments leave one →
  1. September 19, 2013 11:40 pm

    Is there a reason you don’t likewise constrain the density() function via from = 0, to = ub?

    • September 19, 2013 11:59 pm

      Good point. I should have done. It doesn’t change the story though…`density` still way overestimates the height of the leftmost peak…but it does at least cut things off at the correct upper bound. It looks like `from` and `to` simply truncate instead of using the bound information to construct a better estimate of the entire density, as does `logspline`.

      • September 20, 2013 3:05 am

        Yeah, it looks “better” but is really just truncating what it would have estimated without the constraints. Nice post by the way; this issue has bugged me but I hadn’t come across `logspline` before, so that is a something I will have to take a look at.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: