Strings and regex (regular expressions)

See Chapter 11 in The Art of R Programming, Chapter 7 in The R Cookbook, Section 2.12 in The R Book (in particular, sections 2.12.5 – 2.12.13). Good discussion of regular expression in sections 7.4-7.8 of Data Manipulation with R. Also see this example on how to melt the dataset.

Set-up

library(stringr)
x <- "  This isn't my Sent$*&ence76 \nand this is a   newline (or so I am told), isn't it?            "

Split it!

strsplit splits by an element, returning a list. The second argument of strsplit is very powerful, as it can be a regex. May then use unlist

y <- strsplit(x," "); y

## [[1]]
##  [1] ""              ""              "This"          "isn't"        
##  [5] "my"            "Sent$*&ence76" "\nand"         "this"         
##  [9] "is"            "a"             ""              ""             
## [13] "newline"       "(or"           "so"            "I"            
## [17] "am"            "told),"        "isn't"         "it?"          
## [21] ""              ""              ""              ""             
## [25] ""              ""              ""              ""             
## [29] ""              ""              ""

str(y)

## List of 1
##  $ : chr [1:31] "" "" "This" "isn't" ...

uyy <- unlist(y) ##simplifies the list into a vector of all its atomic components
uyy

##  [1] ""              ""              "This"          "isn't"        
##  [5] "my"            "Sent$*&ence76" "\nand"         "this"         
##  [9] "is"            "a"             ""              ""             
## [13] "newline"       "(or"           "so"            "I"            
## [17] "am"            "told),"        "isn't"         "it?"          
## [21] ""              ""              ""              ""             
## [25] ""              ""              ""              ""             
## [29] ""              ""              ""

How many times was each character used in the phrase?

table(strsplit(x, character(0)))

## 
## \n     ,  ?  '  (  )  *  &  $  6  7  a  c  d  e  h  i  I  l  m  n  o  r  s  S 
##  1 31  1  1  2  1  1  1  1  1  1  1  3  1  2  5  2  7  1  2  2  7  3  1  6  1 
##  t  T  w  y 
##  6  1  1  1

How many words of each length are there in our phrase?

table(lapply(strsplit(x, " "), nchar))

## 
##  0  1  2  3  4  5  6  7 13 
## 15  2  4  2  3  2  1  1  1

Does the phrase contain a certain element?

grep, the global regular expression parser

grep("s ", x)

## [1] 1

grepl("s ", x)  #logic

## [1] TRUE

How many times does it occur?

when the target is a vector, grep lists all elements, and length returns the count

grep("s", uyy)

## [1]  3  4  8  9 15 19

length(grep("is", uyy))

## [1] 5

Extract a fragment of the string

substr the substring

substr (x, 3, 7)

## [1] "This "

Format a string to a certain type of output, with sprintf

Examples are from ?sprintf

sprintf("%s is %f feet tall\n", "Sven", 7.1)      # OK

## [1] "Sven is 7.100000 feet tall\n"

try(sprintf("%s is %i feet tall\n", "Sven", 7.1)) # only integer-valued reals get coerced to integer.

## Error in sprintf("%s is %i feet tall\n", "Sven", 7.1) : 
##   invalid format '%i'; use format %f, %e, %g or %a for numeric objects

sprintf("%s is %i feet tall\n", "Sven", 7  )  # OK

## [1] "Sven is 7 feet tall\n"

sprintf("%1.f",101)

## [1] "101"

## More sophisticated:
sprintf("min 10-char string '%10s'",
        c("a", "ABC", "and an even longer one"))

## [1] "min 10-char string '         a'"            
## [2] "min 10-char string '       ABC'"            
## [3] "min 10-char string 'and an even longer one'"

n <- 1:10
sprintf(paste("e with %2d digits = %.",n,"g",sep=""), n, exp(1))

##  [1] "e with  1 digits = 3"           "e with  2 digits = 2.7"        
##  [3] "e with  3 digits = 2.72"        "e with  4 digits = 2.718"      
##  [5] "e with  5 digits = 2.7183"      "e with  6 digits = 2.71828"    
##  [7] "e with  7 digits = 2.718282"    "e with  8 digits = 2.7182818"  
##  [9] "e with  9 digits = 2.71828183"  "e with 10 digits = 2.718281828"

sprintf("%s %d", "test", 1:3) ## re-cycle arguments

## [1] "test 1" "test 2" "test 3"

Regular expressions, a VERY brief introduction

Modifiers. Table 7.1 from “Data Manipulation with R” Find the position of the first occurence using regexpr

uyy

##  [1] ""              ""              "This"          "isn't"        
##  [5] "my"            "Sent$*&ence76" "\nand"         "this"         
##  [9] "is"            "a"             ""              ""             
## [13] "newline"       "(or"           "so"            "I"            
## [17] "am"            "told),"        "isn't"         "it?"          
## [21] ""              ""              ""              ""             
## [25] ""              ""              ""              ""             
## [29] ""              ""              ""

str(regexpr("is", uyy)) ## finds the character position of the first occurence

##  int [1:31] -1 -1 3 1 -1 -1 -1 3 1 -1 ...
##  - attr(*, "match.length")= int [1:31] -1 -1 2 2 -1 -1 -1 2 2 -1 ...
##  - attr(*, "index.type")= chr "chars"
##  - attr(*, "useBytes")= logi TRUE

#str(gregexpr("is", uyy)) # all occurences

Replace a fragment inside the string

gsub("my", "your", x) #replaces all occurences, whereas "sub"" only the first

## [1] "  This isn't your Sent$*&ence76 \nand this is a   newline (or so I am told), isn't it?            "

gsub("\\s{2,}"," ",x, perl=T) #2 or more spaces, newlines

## [1] " This isn't my Sent$*&ence76 and this is a newline (or so I am told), isn't it? "

Look up the ?regex page for more info. Can use structures like [Sa-z:$,] or [a-zA-Z ] where “[]” means “any.” The square brackets indicate a new character class. Round brackets () group patterns together. Can combine them with “|” the OR operator.
The start of a string is ^ and end is $.
Repetition: {4,} is 4 or more, and {,4} is up to 4. When you have a complicated string structure, split it, then process the fragments:

# dfSold$address <- strsplit(as.character(dfSold$address), split=",")
# dfSold$zip <- sapply (dfSold$address,  function(x) gsub("AZ|\\s", "", x[3]))

Eliminate whitespace:

looseText <- "Whate    ever text     and    then add a tab here.     "
trimmedText <- looseText %>% str_squish(); trimmedText

## [1] "Whate ever text and then add a tab here."

Remove trailing blanks:

gsub(" *$", "", x) # means "repeat space zero or more times", at the end of the string

## [1] "  This isn't my Sent$*&ence76 \nand this is a   newline (or so I am told), isn't it?"

Remove brackets:

gsub("\\(.*\\)", "", x) # "repeat dot (=any char) zero or more times"

## [1] "  This isn't my Sent$*&ence76 \nand this is a   newline , isn't it?            "

Extract content within brackets:

pos <- regexpr("\\(.*\\)", x)
pos

## [1] 56
## attr(,"match.length")
## [1] 17
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

substring (x, first=pos+1, last=pos+attr(pos,"match.length")-2)

## [1] "or so I am told"

Split up using a separator

separator <- " "
splitList <- str_split(trimmedText, separator); splitList ## yields a list

## [[1]]
## [1] "Whate" "ever"  "text"  "and"   "then"  "add"   "a"     "tab"   "here."

splitVector <- unlist(splitList); splitVector

## [1] "Whate" "ever"  "text"  "and"   "then"  "add"   "a"     "tab"   "here."

For other extract examples see ‘Tagging’ (p. 98) in Data Manipulation with R

Regex patterns

#"\^ [a-zA-Z]+\\.jpg$" a .jpg filename consisting of all letters

Dates

x <- c("3/2/2012", "11/30/2013")
as.Date(x, format = "%m/%d/%Y") #b is 3-letter abbrev for month, y is 2-digit for year

## [1] "2012-03-02" "2013-11-30"