Toy datasets with random NAs

Simulated (toy) datasets are very helpful to test data analysis tools and various other functions or transformations. For example, inserting random blanks (NAs) may allow testing imputation procedures. I have created a function to quickly insert NAs into a vector, that can be used across rows, columns, or on the whole data frame with one…

Strings and regex (regular expressions)

See Chapter 11 in The Art of R Programming, Chapter 7 in The R Cookbook, Section 2.12 in The R Book (in particular, sections 2.12.5 – 2.12.13). Good discussion of regular expression in sections 7.4-7.8 of Data Manipulation with R. Also see this example on how to melt the dataset. Set-up Split it! strsplit splits…

apply – a most useful family of functions

This is a very important function (family) in a vector-type language like R. Its members are apply, lapply, sapply, mapply, and tapply. I use lapply and sapply most often, although for a matrix, apply is more suitable. These functions work well for complicated iterative calculations and are MUCH faster than loops, that appear in comparison,…

MLE – The Maximum Likelihood Estimate

The maximum likelihood estimation (MLE) is a general method to find the function that most likely fits the available data; it therefore addresses a central problem in data sciences. Depending on the model, the math behind MLE can be very complicated, but an intuitive way to think about it is through the following thought experiment….