13. Data Management

Here we look at some common tasks that come up when dealing with data. These tasks range from assembling different data sets into more convenient forms and ways to apply functions to different parts of the data sets. The topics in this section demonstrate some of the power of R, but it may not be clear at first. The functions are commonly used in a wide variety of circumstances for a number of different reasons. These tools have saved me a great deal of time and effort in circumstances that I would not have predicted in advance.

The important thing to note, though, is that this section is called “Data Management.” It is not called “Data Manipulation.” Politicians “manipulate” data, we “manage” them.

13.1. Appending Data

When you have more than one set of data you may want to bring them together. You can bring different data sets together by appending as rows (rbind) or by appending as columns (cbind). The first example shows how this done with two data frames. The arguments to the functions can take any number of objects. We only use two here to keep the demonstration simpler, but additional data frames can be appended in the same call. It is important to note that when you bring things together as rows the names of the objects within the data frame must be the same.

> a <- data.frame(one=c( 0, 1, 2),two=c("a","a","b"))
> b <- data.frame(one=c(10,11,12),two=c("c","c","d"))
> a
  one two
1   0   a
2   1   a
3   2   b
> b
  one two
1  10   c
2  11   c
3  12   d
> v <- rbind(a,b)
> typeof(v)
[1] "list"
> v
  one two
1   0   a
2   1   a
3   2   b
4  10   c
5  11   c
6  12   d
> w <- cbind(a,b)
> typeof(w)
[1] "list"
> w
  one two one two
1   0   a  10   c
2   1   a  11   c
3   2   b  12   d
> names(w) = c("one","two","three","four")
> w
  one two three four
1   0   a    10    c
2   1   a    11    c
3   2   b    12    d

The same commands also work with vectors and matrices and behave in a similar manner.

> A = matrix(c( 1, 2, 3, 4, 5, 6),ncol=3,byrow=TRUE)
> A
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
> B = matrix(c(10,20,30,40,50,60),ncol=3,byrow=TRUE)
> B
     [,1] [,2] [,3]
[1,]   10   20   30
[2,]   40   50   60
> V <- rbind(A,B)
> typeof(V)
[1] "double"
> V
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]   10   20   30
[4,]   40   50   60
> W <- cbind(A,B)
> typeof(W)
[1] "double"
> W
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    2    3   10   20   30
[2,]    4    5    6   40   50   60

13.2. Applying Functions Across Data Elements

The various apply functions can be an invaluable tool when trying to work with subsets within a data set. The different versions of the apply commands are used to take a function and have the function perform an operation on each part of the data. There are a wide variety of these commands, but we only look at two sets of them. The first set, lapply and sapply, is used to apply a function to every element in a list. The second one, tapply, is used to apply a function on each set broken up by a given set of factors.

13.2.1. Operations on Lists and Vectors

First, the lapply command is used to take a list of items and perform some function on each member of the list. That is, the list includes a number of different objects. You want to perform some operation on every object within the list. You can use lapply to tell R to go through each item in the list and perform the desired action on each item.

In the following example a list is created with three elements. The first is a randomly generated set of numbers with a normal distribution. The second is a randomly generated set of numbers with an exponential distribution. The last is a set of factors. A summary is then performed on each element in the list.

> x <- list(a=rnorm(200,mean=1,sd=10),
            b=rexp(300,10.0),
            c=as.factor(c("a","b","b","b","c","c")))
> lapply(x,summary)
$a
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
-26.65000  -6.91200  -0.39250   0.09478   6.86700  32.00000

$b
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
0.0001497 0.0242300 0.0633300 0.0895400 0.1266000 0.7160000

$c
a b c
1 3 2

The lapply command returns a list. The entries in the list have the same names as the entries in the list that is passed to it. The values of each entry are the results from applying the function. The sapply function is similar, but the difference is that it tries to turn the result into a vector or matrix if possible. If it does not make sense then it returns a list just like the lapply command.

> x <- list(a=rnorm(8,mean=1,sd=10),b=rexp(10,10.0))
> x
$a
[1]  -0.3881426   6.2910959  13.0265859  -1.5296377   6.9285984 -28.3050569
[7]  11.9119731  -7.6036997

$b
[1] 0.212689007 0.081818395 0.222462531 0.181424705 0.168476454 0.002924134
[7] 0.007010114 0.016301837 0.081291728 0.055426055

> val <- lapply(x,mean)
> typeof(val)
[1] "list"
> val
$a
[1] 0.04146456

$b
[1] 0.1029825

> val$a
[1] 0.04146456
> val$b
[1] 0.1029825
>
>
> other <- sapply(x,mean)
> typeof(other)
[1] "double"
> other
         a          b
0.04146456 0.10298250
> other[1]
         a
0.04146456
> other[2]
         b
0.1029825

13.2.2. Operations By Factors

Another widely used variant of the apply functions is the tapply function. The tapply function will take a list of data, usually a vector, a list of factors of the same list, and a function. It will then apply the function to each subset of the data that matches each of the factors.

> val <- data.frame(a=c(1,2,10,20,5,50),
                    b=as.factor(c("a","a","b","b","a","b")))
> val
   a b
1  1 a
2  2 a
3 10 b
4 20 b
5  5 a
6 50 b
> result <- tapply(val$a,val$b,mean)
> typeof(result)
[1] "double"
> result
       a         b
2.666667 26.666667
> result[1]
       a
2.666667
> result[2]
       b
26.66667
> result <- tapply(val$a,val$b,summary)
> typeof(result)
[1] "list"
> result
$a
 Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
1.000   1.500   2.000   2.667   3.500   5.000

$b
 Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
10.00   15.00   20.00   26.67   35.00   50.00

> result$a
 Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
1.000   1.500   2.000   2.667   3.500   5.000
> result$b
 Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
10.00   15.00   20.00   26.67   35.00   50.00