Toy datasets with random NAs

Simulated (toy) datasets are very helpful to test data analysis tools and various other functions or transformations. For example, inserting random blanks (NAs) may allow testing imputation procedures. I have created a function to quickly insert NAs into a vector, that can be used across rows, columns, or on the whole data frame with one of the apply family functions.

library(kbtools)
kbtools::libmgr("l", c("kbtools", "rms", "ggplot2", "pwr", "epiR"))

Create a toy dataset

n <- 100
age <- rnorm (n, 50, 10)
b <- sample(1:3*n, n, replace=T)
c <- age * rbeta(n, 1, 5)
d <- sqrt((age/8)^3 *c + log(b) * rbeta(n, 2, 4))
e <- as.factor(sample(c("never", "former", "current"), n, replace=T, prob=c(.5, .3, .2))) # as in smoking status
f <- as.logical(sample(c("TRUE", "FALSE"), n, replace=T, prob = c(0.2, 0.8)))
df <- data.frame(a = age, lab=b, c=c, d=d, smoking=e, logi=f) 
head(df); summary(df)
##          a lab         c         d smoking  logi
## 1 34.07224 100 11.870923  30.30698   never FALSE
## 2 61.03397 300 16.606928  85.88570   never FALSE
## 3 55.24809 100 34.943976 107.28290  former FALSE
## 4 64.42273 300  6.985874  60.42484   never FALSE
## 5 60.61774 300 32.672337 119.22872   never FALSE
## 6 56.77071 200  9.100284  57.04091 current  TRUE
##        a              lab            c                d              smoking  
##  Min.   :17.46   Min.   :100   Min.   : 0.297   Min.   :  5.365   current:22  
##  1st Qu.:42.22   1st Qu.:100   1st Qu.: 2.794   1st Qu.: 21.022   former :26  
##  Median :48.48   Median :200   Median : 7.130   Median : 38.910   never  :52  
##  Mean   :48.66   Mean   :202   Mean   : 9.200   Mean   : 43.172               
##  3rd Qu.:54.58   3rd Qu.:300   3rd Qu.:13.349   3rd Qu.: 57.761               
##  Max.   :81.98   Max.   :300   Max.   :34.944   Max.   :140.758               
##     logi        
##  Mode :logical  
##  FALSE:78       
##  TRUE :22       

Function to insert NAs (blanks) into a vector

The user can set a fraction of the values/cells that will be cleared out, at random. If this parameter is not specified, it will default at 10% (but that can be changed to any other arbitrary value):

blankOut <- function(vector, fractionBlank){
  if (missing(fractionBlank)) fractionBlank <- 10  #10% of the vector if not set
  if (!is.numeric(fractionBlank)) fractionBlank <- 10
  if ((fractionBlank < 0 )|(fractionBlank > 99)) fractionBlank <-10

  vector[sample(1:length(vector), 
                floor(length(vector) * fractionBlank/100), 
                replace=T)] <- NA
  return(x)
}

Apply this function to the data frame

df. <- data.frame(lapply (df, blankOut, fractionBlank=25))
summary(df.)
##        a              lab              c                d          
##  Min.   :17.46   Min.   :100.0   Min.   : 0.297   Min.   :  6.514  
##  1st Qu.:42.61   1st Qu.:100.0   1st Qu.: 2.659   1st Qu.: 19.867  
##  Median :48.94   Median :200.0   Median : 6.977   Median : 39.117  
##  Mean   :49.28   Mean   :201.2   Mean   : 8.590   Mean   : 42.702  
##  3rd Qu.:55.89   3rd Qu.:300.0   3rd Qu.:11.965   3rd Qu.: 56.678  
##  Max.   :81.98   Max.   :300.0   Max.   :34.944   Max.   :119.229  
##  NA's   :22      NA's   :17      NA's   :24       NA's   :22       
##     smoking      logi        
##  current:17   Mode :logical  
##  former :21   FALSE:64       
##  never  :41   TRUE :16       
##  NA's   :21   NA's :20       

The number of NAs is variable (between 17-24 in this example), because of the replace=TRUE parameter in the sample function. Setting it to FALSE results in a strictly equal number of blanks.

Next, will verify that the function has not changed the type of variables:

sapply(df, class)
##         a       lab         c         d   smoking      logi 
## "numeric" "numeric" "numeric" "numeric"  "factor" "logical"
sapply(df., class)
##         a       lab         c         d   smoking      logi 
## "numeric" "numeric" "numeric" "numeric"  "factor" "logical"

Leave a Reply