Simulated (toy) datasets are very helpful to test data analysis tools and various other functions or transformations. For example, inserting random blanks (NAs) may allow testing imputation procedures. I have created a function to quickly insert NAs into a vector, that can be used across rows, columns, or on the whole data frame with one of the apply
family functions.
library(kbtools)
kbtools::libmgr("l", c("kbtools", "rms", "ggplot2", "pwr", "epiR"))
Create a toy dataset
n <- 100
age <- rnorm (n, 50, 10)
b <- sample(1:3*n, n, replace=T)
c <- age * rbeta(n, 1, 5)
d <- sqrt((age/8)^3 *c + log(b) * rbeta(n, 2, 4))
e <- as.factor(sample(c("never", "former", "current"), n, replace=T, prob=c(.5, .3, .2))) # as in smoking status
f <- as.logical(sample(c("TRUE", "FALSE"), n, replace=T, prob = c(0.2, 0.8)))
df <- data.frame(a = age, lab=b, c=c, d=d, smoking=e, logi=f)
head(df); summary(df)
## a lab c d smoking logi
## 1 34.07224 100 11.870923 30.30698 never FALSE
## 2 61.03397 300 16.606928 85.88570 never FALSE
## 3 55.24809 100 34.943976 107.28290 former FALSE
## 4 64.42273 300 6.985874 60.42484 never FALSE
## 5 60.61774 300 32.672337 119.22872 never FALSE
## 6 56.77071 200 9.100284 57.04091 current TRUE
## a lab c d smoking
## Min. :17.46 Min. :100 Min. : 0.297 Min. : 5.365 current:22
## 1st Qu.:42.22 1st Qu.:100 1st Qu.: 2.794 1st Qu.: 21.022 former :26
## Median :48.48 Median :200 Median : 7.130 Median : 38.910 never :52
## Mean :48.66 Mean :202 Mean : 9.200 Mean : 43.172
## 3rd Qu.:54.58 3rd Qu.:300 3rd Qu.:13.349 3rd Qu.: 57.761
## Max. :81.98 Max. :300 Max. :34.944 Max. :140.758
## logi
## Mode :logical
## FALSE:78
## TRUE :22
Function to insert NAs (blanks) into a vector
The user can set a fraction of the values/cells that will be cleared out, at random. If this parameter is not specified, it will default at 10% (but that can be changed to any other arbitrary value):
blankOut <- function(vector, fractionBlank){
if (missing(fractionBlank)) fractionBlank <- 10 #10% of the vector if not set
if (!is.numeric(fractionBlank)) fractionBlank <- 10
if ((fractionBlank < 0 )|(fractionBlank > 99)) fractionBlank <-10
vector[sample(1:length(vector),
floor(length(vector) * fractionBlank/100),
replace=T)] <- NA
return(x)
}
Apply this function to the data frame
df. <- data.frame(lapply (df, blankOut, fractionBlank=25))
summary(df.)
## a lab c d
## Min. :17.46 Min. :100.0 Min. : 0.297 Min. : 6.514
## 1st Qu.:42.61 1st Qu.:100.0 1st Qu.: 2.659 1st Qu.: 19.867
## Median :48.94 Median :200.0 Median : 6.977 Median : 39.117
## Mean :49.28 Mean :201.2 Mean : 8.590 Mean : 42.702
## 3rd Qu.:55.89 3rd Qu.:300.0 3rd Qu.:11.965 3rd Qu.: 56.678
## Max. :81.98 Max. :300.0 Max. :34.944 Max. :119.229
## NA's :22 NA's :17 NA's :24 NA's :22
## smoking logi
## current:17 Mode :logical
## former :21 FALSE:64
## never :41 TRUE :16
## NA's :21 NA's :20
The number of NAs is variable (between 17-24 in this example), because of the replace=TRUE
parameter in the sample
function. Setting it to FALSE
results in a strictly equal number of blanks.
Next, will verify that the function has not changed the type of variables:
sapply(df, class)
## a lab c d smoking logi
## "numeric" "numeric" "numeric" "numeric" "factor" "logical"
sapply(df., class)
## a lab c d smoking logi
## "numeric" "numeric" "numeric" "numeric" "factor" "logical"