Skip to content

A simpler version of the R scale command

September 24, 2012

R is often great for anticipating what applied statisticians want to do. But sometimes, I find that I disagree with R’s choices. A case-in-point is the scale command, which subtracts the mean and divides by the standard deviation of an input variable. R thinks that I want to do this separately for each column of a matrix,

set.seed(1)
df <- data.frame(x = rnorm(10, -5), y = rnorm(10, 5))
mat <- as.matrix(df)
scale(mat)
##             x       y
##  [1,] -0.9719  1.1808
##  [2,]  0.0659  0.1318
##  [3,] -1.2399 -0.8135
##  [4,]  1.8743 -2.3034
##  [5,]  0.2528  0.8191
##  [6,] -1.2205 -0.2747
##  [7,]  0.4551 -0.2478
##  [8,]  0.7765  0.6498
##  [9,]  0.5683  0.5352
## [10,] -0.5606  0.3226
## attr(,"scaled:center")
##      x      y 
## -4.868  5.249 
## attr(,"scaled:scale")
##      x      y 
## 0.7806 1.0695 

And the same thing occurs if we keep x as a data frame,

scale(df)
##             x       y
##  [1,] -0.9719  1.1808
##  [2,]  0.0659  0.1318
##  [3,] -1.2399 -0.8135
##  [4,]  1.8743 -2.3034
##  [5,]  0.2528  0.8191
##  [6,] -1.2205 -0.2747
##  [7,]  0.4551 -0.2478
##  [8,]  0.7765  0.6498
##  [9,]  0.5683  0.5352
## [10,] -0.5606  0.3226
## attr(,"scaled:center")
##      x      y 
## -4.868  5.249 
## attr(,"scaled:scale")
##      x      y 
## 0.7806 1.0695 

But what I really want is for scale to treat the entire matrix as a single variable. So I wrote this function,

simple.scale <- function(x, center = TRUE, scale = TRUE, simplify = TRUE) {
    
    # simplify x into an array (or atmoic vector or matrix)
    if (simplify) 
        x <- simplify2array(x)
    
    # prelim: calculate mean and squared deviations
    mean.x <- mean(x, na.rm = TRUE)
    sqrdev.x <- (x - mean.x)^2
    
    # center and scale
    if (center) 
        x <- x - mean.x
    if (scale) 
        x <- x/sqrt(mean(sqrdev.x, na.rm = TRUE))
    
    return(x)
}

And this function gives the desired result,

simple.scale(mat)
##             x      y
##  [1,] -1.1327 1.2308
##  [2,] -0.9749 1.0124
##  [3,] -1.1734 0.8155
##  [4,] -0.7000 0.5052
##  [5,] -0.9465 1.1555
##  [6,] -1.1704 0.9277
##  [7,] -0.9158 0.9333
##  [8,] -0.8669 1.1203
##  [9,] -0.8986 1.0964
## [10,] -1.0701 1.0521

Note that the large mean difference between x and y is maintained in the scaling (cf. the behaviour of scale above). Note also that I use the population (rather than the sampe) standard deviation…just a preference of mine

I might integrate this into the multitable package, because the multitable framework treats matrices and arrays as variables (not sets of variables, which scale does).

Advertisements
No comments yet

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: