Strings and regex (regular expressions)

See Chapter 11 in The Art of R Programming, Chapter 7 in The R Cookbook, Section 2.12 in The R Book (in particular, sections 2.12.5 – 2.12.13). Good discussion of regular expression in sections 7.4-7.8 of Data Manipulation with R. Also see this example on how to melt the dataset. Set-up Split it! strsplit splits…

apply – a most useful family of functions

This is a very important function (family) in a vector-type language like R. Its members are apply, lapply, sapply, mapply, and tapply. I use lapply and sapply most often, although for a matrix, apply is more suitable. These functions work well for complicated iterative calculations and are MUCH faster than loops, that appear in comparison,…

The hypergeometric distribution

The hypergeometric distribution is used to solve the classic “balls in an urn” proble. Suppose one has 7 red balls and 3 white ball in an urn, and draws 2 balls. What is the probability that both balls are white? Let’s make a function to allow quick replication Other combinations

MLE – The Maximum Likelihood Estimate

The maximum likelihood estimation (MLE) is a general method to find the function that most likely fits the available data; it therefore addresses a central problem in data sciences. Depending on the model, the math behind MLE can be very complicated, but an intuitive way to think about it is through the following thought experiment….

The binomial distribution

The binomial distribution is used when there are n (a fixed number) independent trials with two possible outcomes (“success” and “failure”) with a probability that is constant. With 10 tosses of a fair coin, what is the probability of getting 7 heads? \[Prob = dbinom(7,10,0.5) = 0.1171875\] And what is the probability of getting exactly…

Fisher’s exact test

Fisher’s exact test is used to compare counts and proportions between groups when small samples of nominal variables are available. It assumes that the individual observations are independent, and that the row and column totals are fixed, or “conditioned.” An example would be putting 12 female hermit crabs and 9 male hermit crabs in an…

The Poisson distribution

A sports team scores 84 points in 21 games, so the average score is 4 points per game. What is the probability that it scores 1 point per game? Or 6 points, or more than than 4? The answer is in the Poisson distribution, where variance = mean, denoted as $\lambda$. The answer to the…