# I don’t use R’s density function anymore

March 19, 2013

## Here is an example of why:

Simulate some data from a mixture of binomial distributions:

```
set.seed(1)
n <- 100 # sample size
ub <- 1000 # upper bound
p1 <- 0.8 # probability
p2 <- 0.99 # probability
X <- rbinom(n, ub, sample(c(p1, p2), size = 100, replace = TRUE))
```

These data come from this density function:

```
xlim <- ub * c(0.7, 1.05)
xseq <- seq(xlim[1], xlim[2], 1)
den_tr <- 0.5 * dbinom(xseq, ub, p1) + 0.5 * dbinom(xseq, ub, p2)
```

```
par(cex.lab = 2, cex.main = 2.2, cex.axis = 1.8, mar = c(5, 6, 2, 1))
plot(xseq, den_tr, type = "l", main = "True density", ylab = "", xlab = "X")
abline(v = ub)
```

The vertical line gives the maximum possible value for the random variable.

Now try estimating the density using three methods

```
den_hs <- hist(X, plot = FALSE)
den_st <- density(X)
library(logspline)
den_ls <- logspline(X, lbound = 0, ubound = ub)
```

```
## Warning: too much data close together
```

```
## possible infinite density at upper end
## running program with fewer knots
```

```
## Warning: re-ran with oldlogspline
```

```
par(cex.lab = 2, cex.main = 2, cex.axis = 1.8, mar = c(5, 6, 2, 1))
plot(den_ls, xlim = xlim, main = "logspline estimate", ylab = "", xlab = "X")
abline(v = ub)
```

In this example, the `logspline`

function does much better than the default settings for `density`

or `hist`

functions.

I'm not saying that `logspline`

will always be best, but it does have some pretty sweet features. In particular, you can set upper and lower bounds on the random variable (zero and 1000 in this case). Also, `logspline`

provides nice info about the fit of the model (because that’s what density estimates are…they’re models),

```
den_ls
```

```
## knots A(1)/D(2) loglik AIC minimum penalty maximum penalty
## 3 2 -568.1 1145.5 270.20 Inf
## 4 2 -433.0 879.9 4.44 270.20
## 5 2 -430.8 880.1 1.94 4.44
## 6 2 -429.9 882.7 1.61 1.94
## 7 2 -429.1 885.7 0.01 1.61
## 8 1 -429.0 890.3 0.00 0.01
## the present optimal number of knots is 4
## penalty(AIC) was the default: BIC=log(samplesize): log( 100 )= 4.61
```

See here for more info.

Advertisements

3 Comments
leave one →

Is there a reason you don’t likewise constrain the

`density()`

function via`from = 0, to = ub`

?Good point. I should have done. It doesn’t change the story though…`density` still way overestimates the height of the leftmost peak…but it does at least cut things off at the correct upper bound. It looks like `from` and `to` simply truncate instead of using the bound information to construct a better estimate of the entire density, as does `logspline`.

Yeah, it looks “better” but is really just truncating what it would have estimated without the constraints. Nice post by the way; this issue has bugged me but I hadn’t come across `logspline` before, so that is a something I will have to take a look at.