Simulated (toy) datasets are very helpful to test data analysis tools and various other functions or transformations. For example, inserting random blanks (NAs) may allow testing imputation procedures. I have created a function to quickly insert NAs into a vector, that can be used across rows, columns, or on the whole data frame with one…

## Power calculations

Underpowered studies are a big (but far from the only) source of the current replication crisis in the medical literature. Power calculations hinge on the expected effect size (often expressed as Cohen’s d), the populations’ spread around the mean (standard deviation) and arbitrary frequentist assumptions about alpha and beta. Cohen’s d is conceptually similar to…

## The Reproducibility Crisis in Medicine

In 2005, Stanford epidemiologist John Ioannidis published the provocatively titled paper “Why most published research findings are false” (Ioannidis, PLoS Med 2005, 2:e12), that has since become a foundational piece of metascience. Among other things, he stated: The smaller the study sample conducted in a scientific field, the less likely the research findings are to…

## Does this claim pass the smell test?

It is hard to quickly evaluate data in an everyday situation, but a nifty shortcut I saw on the R-bloggers aggregator can help. In the simplest example, suppose you toss a coin 50 times with 32 heads and 18 tails. What are the chances the coin is fair? The handy shortcut helps to quickly evaluate…

## Bootstraping

Bootstrapping (or ‘the bootstrap’) is a statistical technique of drawing repeated samples (resampling) with replacement from an available sample; this ultimately allows one to draw inferences about a population from the available sample. The number of resamples is usually large (say, 10,000), although with a representative sample, 50 resamples will get you there. This is…

## Strings and regex (regular expressions)

See Chapter 11 in The Art of R Programming, Chapter 7 in The R Cookbook, Section 2.12 in The R Book (in particular, sections 2.12.5 – 2.12.13). Good discussion of regular expression in sections 7.4-7.8 of Data Manipulation with R. Also see this example on how to melt the dataset. Set-up Split it! strsplit splits…

## apply – a most useful family of functions

This is a very important function (family) in a vector-type language like R. Its members are apply, lapply, sapply, mapply, and tapply. I use lapply and sapply most often, although for a matrix, apply is more suitable. These functions work well for complicated iterative calculations and are MUCH faster than loops, that appear in comparison,…

## The hypergeometric distribution

The hypergeometric distribution is used to solve the classic “balls in an urn” proble. Suppose one has 7 red balls and 3 white ball in an urn, and draws 2 balls. What is the probability that both balls are white? Let’s make a function to allow quick replication Other combinations

## MLE – The Maximum Likelihood Estimate

The maximum likelihood estimation (MLE) is a general method to find the function that most likely fits the available data; it therefore addresses a central problem in data sciences. Depending on the model, the math behind MLE can be very complicated, but an intuitive way to think about it is through the following thought experiment….

## The binomial distribution

The binomial distribution is used when there are n (a fixed number) independent trials with two possible outcomes (“success” and “failure”) with a probability that is constant. With 10 tosses of a fair coin, what is the probability of getting 7 heads? \[Prob = dbinom(7,10,0.5) = 0.1171875\] And what is the probability of getting exactly…