# 13. Data Management¶

Here we look at some common tasks that come up when dealing with data. These tasks range from assembling different data sets into more convenient forms and ways to apply functions to different parts of the data sets. The topics in this section demonstrate some of the power of R, but it may not be clear at first. The functions are commonly used in a wide variety of circumstances for a number of different reasons. These tools have saved me a great deal of time and effort in circumstances that I would not have predicted in advance.

The important thing to note, though, is that this section is called
“*Data Management.*” It is not called “*Data Manipulation.*”
Politicians “manipulate” data, we “manage” them.

## 13.1. Appending Data¶

When you have more than one set of data you may want to bring them
together. You can bring different data sets together by appending as
rows (*rbind*) or by appending as columns (*cbind*). The first example
shows how this done with two data frames. The arguments to the
functions can take any number of objects. We only use two here to keep
the demonstration simpler, but additional data frames can be appended
in the same call. It is important to note that when you bring things
together as rows the names of the objects within the data frame must
be the same.

```
> a <- data.frame(one=c( 0, 1, 2),two=c("a","a","b"))
> b <- data.frame(one=c(10,11,12),two=c("c","c","d"))
> a
one two
1 0 a
2 1 a
3 2 b
> b
one two
1 10 c
2 11 c
3 12 d
> v <- rbind(a,b)
> typeof(v)
[1] "list"
> v
one two
1 0 a
2 1 a
3 2 b
4 10 c
5 11 c
6 12 d
> w <- cbind(a,b)
> typeof(w)
[1] "list"
> w
one two one two
1 0 a 10 c
2 1 a 11 c
3 2 b 12 d
> names(w) = c("one","two","three","four")
> w
one two three four
1 0 a 10 c
2 1 a 11 c
3 2 b 12 d
```

The same commands also work with vectors and matrices and behave in a similar manner.

```
> A = matrix(c( 1, 2, 3, 4, 5, 6),ncol=3,byrow=TRUE)
> A
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
> B = matrix(c(10,20,30,40,50,60),ncol=3,byrow=TRUE)
> B
[,1] [,2] [,3]
[1,] 10 20 30
[2,] 40 50 60
> V <- rbind(A,B)
> typeof(V)
[1] "double"
> V
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 10 20 30
[4,] 40 50 60
> W <- cbind(A,B)
> typeof(W)
[1] "double"
> W
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 2 3 10 20 30
[2,] 4 5 6 40 50 60
```

## 13.2. Applying Functions Across Data Elements¶

The various *apply* functions can be an invaluable tool when trying to
work with subsets within a data set. The different versions of the
*apply* commands are used to take a function and have the function
perform an operation on each part of the data. There are a wide
variety of these commands, but we only look at two sets of them. The
first set, *lapply* and *sapply*, is used to apply a function to every
element in a list. The second one, *tapply*, is used to apply a
function on each set broken up by a given set of factors.

### 13.2.1. Operations on Lists and Vectors¶

First, the *lapply* command is used to take a list of items and
perform some function on each member of the list. That is, the list
includes a number of different objects. You want to perform some
operation on **every** object within the list. You can use *lapply* to
tell R to go through each item in the list and perform the desired
action on each item.

In the following example a list is created with three elements. The first is a randomly generated set of numbers with a normal distribution. The second is a randomly generated set of numbers with an exponential distribution. The last is a set of factors. A summary is then performed on each element in the list.

```
> x <- list(a=rnorm(200,mean=1,sd=10),
b=rexp(300,10.0),
c=as.factor(c("a","b","b","b","c","c")))
> lapply(x,summary)
$a
Min. 1st Qu. Median Mean 3rd Qu. Max.
-26.65000 -6.91200 -0.39250 0.09478 6.86700 32.00000
$b
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0001497 0.0242300 0.0633300 0.0895400 0.1266000 0.7160000
$c
a b c
1 3 2
```

The *lapply* command returns a list. The entries in the list have the
same names as the entries in the list that is passed to it. The values
of each entry are the results from applying the function. The *sapply*
function is similar, but the difference is that it **tries** to turn
the result into a vector or matrix if possible. If it does not make
sense then it returns a list just like the *lapply* command.

```
> x <- list(a=rnorm(8,mean=1,sd=10),b=rexp(10,10.0))
> x
$a
[1] -0.3881426 6.2910959 13.0265859 -1.5296377 6.9285984 -28.3050569
[7] 11.9119731 -7.6036997
$b
[1] 0.212689007 0.081818395 0.222462531 0.181424705 0.168476454 0.002924134
[7] 0.007010114 0.016301837 0.081291728 0.055426055
> val <- lapply(x,mean)
> typeof(val)
[1] "list"
> val
$a
[1] 0.04146456
$b
[1] 0.1029825
> val$a
[1] 0.04146456
> val$b
[1] 0.1029825
>
>
> other <- sapply(x,mean)
> typeof(other)
[1] "double"
> other
a b
0.04146456 0.10298250
> other[1]
a
0.04146456
> other[2]
b
0.1029825
```

### 13.2.2. Operations By Factors¶

Another widely used variant of the *apply* functions is the *tapply*
function. The *tapply* function will take a list of data, usually a
vector, a list of factors of the same list, and a function. It will
then apply the function to each subset of the data that matches each
of the factors.

```
> val <- data.frame(a=c(1,2,10,20,5,50),
b=as.factor(c("a","a","b","b","a","b")))
> val
a b
1 1 a
2 2 a
3 10 b
4 20 b
5 5 a
6 50 b
> result <- tapply(val$a,val$b,mean)
> typeof(result)
[1] "double"
> result
a b
2.666667 26.666667
> result[1]
a
2.666667
> result[2]
b
26.66667
> result <- tapply(val$a,val$b,summary)
> typeof(result)
[1] "list"
> result
$a
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.500 2.000 2.667 3.500 5.000
$b
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.00 15.00 20.00 26.67 35.00 50.00
> result$a
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.500 2.000 2.667 3.500 5.000
> result$b
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.00 15.00 20.00 26.67 35.00 50.00
```