This is a very important function (family) in a vector-type language like R. Its members are `apply`

, `lapply`

, `sapply`

, `mapply`

, and `tapply`

. I use `lapply`

and `sapply`

most often, although for a matrix, `apply`

is more suitable. These functions work well for complicated iterative calculations and are MUCH faster than loops, that appear in comparison, as brute-force methods.

Set up a toy dataset

```
n <- 100
age <- rnorm (n, 50, 10)
b <- sample(1:3*n, n, replace=T)
c <- age * rbeta(n, 1, 5)
d <- (age/8)^3 *c + log(b) * rbeta(n, 2, 4)
e <- as.factor(sample(c("never", "former", "current"), n, replace=T, prob=c(.5, .3, .2))) # as in smoking status
f <- as.logical(sample(c("TRUE", "FALSE"), n, replace=T, prob = c(0.2, 0.8)))
df <- data.frame(a = age, b=b, c=c, d=d, e=e, f=f)
head(df) #summary(df)
```

```
## a b c d e f
## 1 52.93529 200 1.665629 485.3124 never FALSE
## 2 47.85212 200 1.532457 329.6634 former FALSE
```

# \(apply\)

Applies a function on a matrix, across dimension, i.e. by lines or by columns (referred to as margins, 1 = row by row, 2 = by columns).

It is easy to check the data type of columns in a data frame:

`sapply(sapply (df, is), function(x) x[1])`

```
## a b c d e f
## "numeric" "numeric" "numeric" "numeric" "factor" "logical"
```

The following do NOT work, even though recommended

`apply(df, 2, class) `

```
## a b c d e f
## "character" "character" "character" "character" "character" "character"
```

`apply (df, 2, function (x) sapply (c("character", "complex", "integer", "logical", "numeric"), function(y) class(x) %in% y))`

```
## a b c d e f
## character TRUE TRUE TRUE TRUE TRUE TRUE
## complex FALSE FALSE FALSE FALSE FALSE FALSE
## integer FALSE FALSE FALSE FALSE FALSE FALSE
## logical FALSE FALSE FALSE FALSE FALSE FALSE
## numeric FALSE FALSE FALSE FALSE FALSE FALSE
```

`apply (df, 2, function(x) get(typeof(x)))`

```
## $a
## function (length = 0L)
## .Internal(vector("character", length))
## <bytecode: 0x7fa3d88b6e50>
## <environment: namespace:base>
##
(trimmed for brevity)
```

There are many other uses for $apply$, such as selecting a subset of the data frame with `df[ ,sapply(df, is.X)]`

`apply(df[,sapply(df,is.numeric)], 2, sd) # by column; does NOT make sense to go by row.`

```
## a b c d
## 10.509733 81.550629 7.387919 4154.847557
```

```
norm <- apply(df[,sapply(df,is.numeric)], 2, shapiro.test); #str(norm)
sapply(norm, function(x) (names(x)))
```

```
## a b c d
## [1,] "statistic" "statistic" "statistic" "statistic"
## [2,] "p.value" "p.value" "p.value" "p.value"
## [3,] "method" "method" "method" "method"
## [4,] "data.name" "data.name" "data.name" "data.name"
```

`sapply(norm, function(x) format(x$p.value, nsmall=2))`

```
## a b c d
## "0.2955282" "1.631008e-10" "3.465593e-10" "2.371274e-16"
```

`sapply(norm, function(x) format(x$statistic, nsmall=5))`

```
## a.W b.W c.W d.W
## "0.9845689" "0.7940378" "0.8047188" "0.5296003"
```

`apply(df[,sapply(df,is.numeric)], 2, summary) `

```
## a b c d
## Min. 25.63873 100 0.059253 10.05707
## 1st Qu. 42.79097 100 3.089976 656.19388
## Median 49.23844 200 7.013626 1480.94939
## Mean 50.54818 196 8.282124 2662.39926
## 3rd Qu. 58.66898 300 10.869602 3178.33952
## Max. 82.39060 300 50.694784 27690.42794
```

Retrieve \(median (IQR)\)

`apply(apply(df[ ,sapply(df,is.numeric)], 2, summary), 2, function (x) c(x[3],x[2],x[5]))`

```
## a b c d
## Median 49.23844 200 7.013626 1480.9494
## 1st Qu. 42.79097 100 3.089976 656.1939
## 3rd Qu. 58.66898 300 10.869602 3178.3395
```

You can add an argument to the function

```
par(mfrow = c(2, 2))
t <- apply(df[ ,sapply(df,is.numeric)], 2, qqnorm, plot.it=T)
```

`par(mfrow = c(1, 1))`

Can be used to add random NAs to a data frame:

```
fractionBlank <- 10
df <- apply (df, 2, function(x) {x[sample(1:n, floor(n/fractionBlank), replace=T)] <- NA; x} )
df <- data.frame(df)
```

And count the number of NAs per column with a **nested** structure

`apply(apply (df, 2, is.na), 2, sum)`

```
## a b c d e f
## 10 10 10 9 10 9
```

# \(lapply\) and \(sapply\)

For a matrix you have to use \(apply\), but for a data frame, can use \(lapply\) (returns a list) or \(sapply\) (simplified lapply, returns a vector). A classic example is **loading libraries**

`sapply(c("ggplot2", "reshape2"), require, character.only = T)`

```
## Loading required package: ggplot2
## Loading required package: reshape2
## ggplot2 reshape2
## TRUE TRUE
```

As already shown above, it is possible to **nest** multiple levels of \(sapply\) to use with functions with two (or more) arguments. For example can create a matrix with 10,000 elements blazingly fast:

```
set.seed(53)
n <- 10
sample <- rnorm (n, 32, 12)
mu <- seq (20, 60, length.out=100);
sd <- seq (9, 16, length.out=100);
loglik <- function(mu, sd) (sum(log(dnorm(sample, mu, sd))))
```

Below is the **nested** application of `sapply`

```
m <- sapply(mu, function(x) sapply(sd, function(y) loglik(x,y)))
dimnames(m) <- list(sd, mu) #str(m)
```

The above is a real example of MLE, and the brute force version (using loops) takes a considerable amount of time.

`mx <- which (m == max(m), arr.ind = T); max(m); mx`

```
## [1] -39.2222
## row col
## 12.2525252525253 47 31
```

`mu[mx[2]] #mle for mu`

`## [1] 32.12121`

`sd[mx[1]] #mle for sd`

`## [1] 12.25253`

`lapply`

can be used to convert a table of explanatory variables into a data frame (more to come on this subject). (Crawley p250)

With `sapply`

, any arg beyond the second will be passed to the function. This example is useful if you have many variables and want to find their correlation:

`#cors <- sapply(df, cor, y=b); cors # call the function cor for every column. `

# \(mapply\)

This version applies to non-vectorized functions, that accept multiple arguments. Although many R functions are vectorized, some are not, including user generated functions. You can use `MoreArgs = list(arg1=sd)`

# \(tapply\)

It is very helpful to vectorize functions. Its syntax is \(tapply (x, INDEX, FUN, …)\) where INDEX is a grouping factor that is coerced to a factor (if not already one). You can do **counts** (using the function *length*), means, etc across datasets by factors with **exceptionally efficient** code (see Section 6.4 in Introduction to Scientific Programming and Simulation Using R. Very neat!

# \(by\)

This function is closely related to \(tapply\) in that \(by\) allows one to apply a function to a group of rows by selecting a grouping factor. The syntax is \(by(dataframe, groupingfactor, function)\). See Section 6.6 in The R Cookbook.

The advantage of \(by\) over \(tapply\) is that it returns a special list that has a print method, but accessing individual elements is less convenient (section 8.2 of Data Manipulation with R).