See Chapter 11 in The Art of R Programming, Chapter 7 in The R Cookbook, Section 2.12 in The R Book (in particular, sections 2.12.5 – 2.12.13). Good discussion of regular expression in sections 7.4-7.8 of Data Manipulation with R. Also see this example on how to melt the dataset.
Set-up
library(stringr)
x <- " This isn't my Sent$*&ence76 \nand this is a newline (or so I am told), isn't it? "
Split it!
strsplit splits by an element, returning a list. The second argument of strsplit is very powerful, as it can be a regex. May then use unlist
y <- strsplit(x," "); y
## [[1]]
## [1] "" "" "This" "isn't"
## [5] "my" "Sent$*&ence76" "\nand" "this"
## [9] "is" "a" "" ""
## [13] "newline" "(or" "so" "I"
## [17] "am" "told)," "isn't" "it?"
## [21] "" "" "" ""
## [25] "" "" "" ""
## [29] "" "" ""
str(y)
## List of 1
## $ : chr [1:31] "" "" "This" "isn't" ...
uyy <- unlist(y) ##simplifies the list into a vector of all its atomic components
uyy
## [1] "" "" "This" "isn't"
## [5] "my" "Sent$*&ence76" "\nand" "this"
## [9] "is" "a" "" ""
## [13] "newline" "(or" "so" "I"
## [17] "am" "told)," "isn't" "it?"
## [21] "" "" "" ""
## [25] "" "" "" ""
## [29] "" "" ""
How many times was each character used in the phrase?
table(strsplit(x, character(0)))
##
## \n , ? ' ( ) * & $ 6 7 a c d e h i I l m n o r s S
## 1 31 1 1 2 1 1 1 1 1 1 1 3 1 2 5 2 7 1 2 2 7 3 1 6 1
## t T w y
## 6 1 1 1
How many words of each length are there in our phrase?
table(lapply(strsplit(x, " "), nchar))
##
## 0 1 2 3 4 5 6 7 13
## 15 2 4 2 3 2 1 1 1
Does the phrase contain a certain element?
grep, the global regular expression parser
grep("s ", x)
## [1] 1
grepl("s ", x) #logic
## [1] TRUE
How many times does it occur?
when the target is a vector, grep lists all elements, and length returns the count
grep("s", uyy)
## [1] 3 4 8 9 15 19
length(grep("is", uyy))
## [1] 5
Extract a fragment of the string
substr the substring
substr (x, 3, 7)
## [1] "This "
Format a string to a certain type of output, with sprintf
Examples are from ?sprintf
sprintf("%s is %f feet tall\n", "Sven", 7.1) # OK
## [1] "Sven is 7.100000 feet tall\n"
try(sprintf("%s is %i feet tall\n", "Sven", 7.1)) # only integer-valued reals get coerced to integer.
## Error in sprintf("%s is %i feet tall\n", "Sven", 7.1) :
## invalid format '%i'; use format %f, %e, %g or %a for numeric objects
sprintf("%s is %i feet tall\n", "Sven", 7 ) # OK
## [1] "Sven is 7 feet tall\n"
sprintf("%1.f",101)
## [1] "101"
## More sophisticated:
sprintf("min 10-char string '%10s'",
c("a", "ABC", "and an even longer one"))
## [1] "min 10-char string ' a'"
## [2] "min 10-char string ' ABC'"
## [3] "min 10-char string 'and an even longer one'"
n <- 1:10
sprintf(paste("e with %2d digits = %.",n,"g",sep=""), n, exp(1))
## [1] "e with 1 digits = 3" "e with 2 digits = 2.7"
## [3] "e with 3 digits = 2.72" "e with 4 digits = 2.718"
## [5] "e with 5 digits = 2.7183" "e with 6 digits = 2.71828"
## [7] "e with 7 digits = 2.718282" "e with 8 digits = 2.7182818"
## [9] "e with 9 digits = 2.71828183" "e with 10 digits = 2.718281828"
sprintf("%s %d", "test", 1:3) ## re-cycle arguments
## [1] "test 1" "test 2" "test 3"
Regular expressions, a VERY brief introduction
Find the position of the first occurence using regexpr
uyy
## [1] "" "" "This" "isn't"
## [5] "my" "Sent$*&ence76" "\nand" "this"
## [9] "is" "a" "" ""
## [13] "newline" "(or" "so" "I"
## [17] "am" "told)," "isn't" "it?"
## [21] "" "" "" ""
## [25] "" "" "" ""
## [29] "" "" ""
str(regexpr("is", uyy)) ## finds the character position of the first occurence
## int [1:31] -1 -1 3 1 -1 -1 -1 3 1 -1 ...
## - attr(*, "match.length")= int [1:31] -1 -1 2 2 -1 -1 -1 2 2 -1 ...
## - attr(*, "index.type")= chr "chars"
## - attr(*, "useBytes")= logi TRUE
#str(gregexpr("is", uyy)) # all occurences
Replace a fragment inside the string
gsub("my", "your", x) #replaces all occurences, whereas "sub"" only the first
## [1] " This isn't your Sent$*&ence76 \nand this is a newline (or so I am told), isn't it? "
gsub("\\s{2,}"," ",x, perl=T) #2 or more spaces, newlines
## [1] " This isn't my Sent$*&ence76 and this is a newline (or so I am told), isn't it? "
Look up the ?regex page for more info. Can use structures like [Sa-z:$,] or [a-zA-Z ] where “[]” means “any.” The square brackets indicate a new character class. Round brackets () group patterns together. Can combine them with “|” the OR operator.
The start of a string is ^ and end is $.
Repetition: {4,} is 4 or more, and {,4} is up to 4. When you have a complicated string structure, split it, then process the fragments:
# dfSold$address <- strsplit(as.character(dfSold$address), split=",")
# dfSold$zip <- sapply (dfSold$address, function(x) gsub("AZ|\\s", "", x[3]))
Eliminate whitespace:
looseText <- "Whate ever text and then add a tab here. "
trimmedText <- looseText %>% str_squish(); trimmedText
## [1] "Whate ever text and then add a tab here."
Remove trailing blanks:
gsub(" *$", "", x) # means "repeat space zero or more times", at the end of the string
## [1] " This isn't my Sent$*&ence76 \nand this is a newline (or so I am told), isn't it?"
Remove brackets:
gsub("\\(.*\\)", "", x) # "repeat dot (=any char) zero or more times"
## [1] " This isn't my Sent$*&ence76 \nand this is a newline , isn't it? "
Extract content within brackets:
pos <- regexpr("\\(.*\\)", x)
pos
## [1] 56
## attr(,"match.length")
## [1] 17
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
substring (x, first=pos+1, last=pos+attr(pos,"match.length")-2)
## [1] "or so I am told"
Split up using a separator
separator <- " "
splitList <- str_split(trimmedText, separator); splitList ## yields a list
## [[1]]
## [1] "Whate" "ever" "text" "and" "then" "add" "a" "tab" "here."
splitVector <- unlist(splitList); splitVector
## [1] "Whate" "ever" "text" "and" "then" "add" "a" "tab" "here."
For other extract examples see ‘Tagging’ (p. 98) in Data Manipulation with R
Regex patterns
#"\^ [a-zA-Z]+\\.jpg$" a .jpg filename consisting of all letters
Dates
x <- c("3/2/2012", "11/30/2013")
as.Date(x, format = "%m/%d/%Y") #b is 3-letter abbrev for month, y is 2-digit for year
## [1] "2012-03-02" "2013-11-30"