stringr::str-match

Function of the Week

Author

Trevor Delsey

Published

March 6, 2024

1 `stringr::str_match`

In this document, I will introduce the str_match function and show what it’s for.

#load tidyverse up
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

#example dataset
library(palmerpenguins)
data(penguins)

1.1 What is it for?

The function str_match takes a character vector and a regex pattern and returns the first time that the pattern appears in each element of the vector. Specifically str_match will return a matrix of all of the matches. We will get to what that means later.

To explain this let’s remind ourselves what a character vector is:

character_vector <- c("This", "is an", "example")
typeof(character_vector)

[1] "character"

Next we need to know a few regular expression (regex) special characters. There are quite a number of them so I will just define the ones I will be using here:

"" regular expressions will be placed inside of quotation marks.
"[Aa]" This will match with either an upper case or lower case "A".
"A" this will only match an upper case "A".
"A+" this will match with an A followed by any number of repeated A's 
"[A-Za-z]" will match with any letter upper or lower case
"." will match with any character
"^a" will match with any a at the start of the string
"a$" will match with any a at the end of the string

Here is the data we will be working with

words_df <- as_tibble(words)
sentences_vec <- sentences

words %>% head(20)

 [1] "a"         "able"      "about"     "absolute"  "accept"    "account"  
 [7] "achieve"   "across"    "act"       "active"    "actual"    "add"      
[13] "address"   "admit"     "advertise" "affect"    "afford"    "after"    
[19] "afternoon" "again"

head(sentences_vec)

[1] "The birch canoe slid on the smooth planks." 
[2] "Glue the sheet to the dark blue background."
[3] "It's easy to tell the depth of a well."     
[4] "These days a chicken leg is a rare dish."   
[5] "Rice is often served in round bowls."       
[6] "The juice of lemons makes fine punch."

length(words)

[1] 980

Lets try to collect every observation in sentences_vec that contains the letters “th”

# we can see that it returns only the matched portion
str_match(sentences_vec, "th")

       [,1]
  [1,] "th"
  [2,] "th"
  [3,] "th"
  [4,] NA  
  [5,] NA  
  [6,] NA  
  [7,] "th"
  [8,] NA  
  [9,] NA  
 [10,] NA  
 [11,] "th"
 [12,] NA  
 [13,] "th"
 [14,] "th"
 [15,] "th"
 [16,] "th"
 [17,] NA  
 [18,] "th"
 [19,] "th"
 [20,] "th"
 [ reached getOption("max.print") -- omitted 700 rows ]

# if we want the whole string that contains the th we can add a few more operators

# adding the . operator looks for a th between any other characters
str_match(sentences_vec, ".th.")

       [,1]  
  [1,] " the"
  [2,] " the"
  [3,] " the"
  [4,] NA    
  [5,] NA    
  [6,] NA    
  [7,] " thr"
  [8,] NA    
  [9,] NA    
 [10,] NA    
 [11,] " the"
 [12,] NA    
 [13,] " the"
 [14,] " the"
 [15,] " the"
 [16,] " the"
 [17,] NA    
 [18,] " the"
 [19,] " the"
 [20,] " the"
 [ reached getOption("max.print") -- omitted 700 rows ]

# and if we finally add the + we will get the entire string
str_match(sentences_vec, ".+th.+")

       [,1]                                                       
  [1,] "The birch canoe slid on the smooth planks."               
  [2,] "Glue the sheet to the dark blue background."              
  [3,] "It's easy to tell the depth of a well."                   
  [4,] NA                                                         
  [5,] NA                                                         
  [6,] NA                                                         
  [7,] "The box was thrown beside the parked truck."              
  [8,] NA                                                         
  [9,] NA                                                         
 [10,] NA                                                         
 [11,] "The boy was there when the sun rose."                     
 [12,] NA                                                         
 [13,] "The source of the huge river is the clear spring."        
 [14,] "Kick the ball straight and follow through."               
 [15,] "Help the woman get back to her feet."                     
 [16,] "A pot of tea helps to pass the evening."                  
 [17,] NA                                                         
 [18,] "The soft cushion broke the man's fall."                   
 [19,] "The salt breeze came across from the sea."                
 [20,] "The girl at the booth sold fifty bonds."                  
 [ reached getOption("max.print") -- omitted 700 rows ]

1.2 Let’s look at some complications with str_match

# recall that words_df is a tibble rather than a vector 

# here we see that str_match does not like recieving an entire data frame
# even if that data frame only contains one column

words_df %>% 
  str_match(".+th.+")

Warning in stri_match_first_regex(string, pattern, opts_regex = opts(pattern)):
argument is not an atomic vector; coercing

     [,1]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
[1,] "c(\"a\", \"able\", \"about\", \"absolute\", \"accept\", \"account\", \"achieve\", \"across\", \"act\", \"active\", \"actual\", \"add\", \"address\", \"admit\", \"advertise\", \"affect\", \"afford\", \"after\", \"afternoon\", \"again\", \"against\", \"age\", \"agent\", \"ago\", \"agree\", \"air\", \"all\", \"allow\", \"almost\", \"along\", \"already\", \"alright\", \"also\", \"although\", \"always\", \"america\", \"amount\", \"and\", \"another\", \"answer\", \"any\", \"apart\", \"apparent\", \"appear\", \"apply\", \"appoint\", \"approach\", \"appropriate\", \"area\", \"argue\", \"arm\", \"around\", "

# and now it will work fine
words_df$value %>% 
  str_match(".+th.+")

       [,1]       
  [1,] NA         
  [2,] NA         
  [3,] NA         
  [4,] NA         
  [5,] NA         
  [6,] NA         
  [7,] NA         
  [8,] NA         
  [9,] NA         
 [10,] NA         
 [11,] NA         
 [12,] NA         
 [13,] NA         
 [14,] NA         
 [15,] NA         
 [16,] NA         
 [17,] NA         
 [18,] NA         
 [19,] NA         
 [20,] NA         
 [ reached getOption("max.print") -- omitted 960 rows ]

Another problem you might encounter

# Notice how the column name is not what I asked it to be

words_df %>% 
  mutate(the = str_match(value, ".+the.+")) %>% 
  drop_na(the)

# A tibble: 12 × 2
   value     the[,1]  
   <chr>     <chr>    
 1 another   another  
 2 bother    bother   
 3 brother   brother  
 4 either    either   
 5 father    father   
 6 further   further  
 7 mother    mother   
 8 other     other    
 9 otherwise otherwise
10 rather    rather   
11 together  together 
12 whether   whether

This is because the output of str_match is a matrix. In this case we have a matrix with only a single vector in it so it works to make the new column but the name is all messed up.

To fix this issue we should just use str_extract which is a very closely related function that returns a vector of matches rather than a matrix

# and we can see this fixes the issue

words_df %>% 
  mutate(the = str_extract(value, ".+the.+")) %>% 
  drop_na(the)

# A tibble: 12 × 2
   value     the      
   <chr>     <chr>    
 1 another   another  
 2 bother    bother   
 3 brother   brother  
 4 either    either   
 5 father    father   
 6 further   further  
 7 mother    mother   
 8 other     other    
 9 otherwise otherwise
10 rather    rather   
11 together  together 
12 whether   whether

1.3 So why does str_match return a matrix at all?

str_match returns a matrix because it is possible for a regex to return multiple different groups of characters. We won’t go into regex groups because it can get fairly complicated but here is an example of what the output would look like.

# Here I asked str_match to split the match into two groups and it assigned each one a column in the matrix

penguins_raw$studyName %>% 
  str_match("([A-Z]+)(\\d+)")

       [,1]      [,2]  [,3]  
  [1,] "PAL0708" "PAL" "0708"
  [2,] "PAL0708" "PAL" "0708"
  [3,] "PAL0708" "PAL" "0708"
  [4,] "PAL0708" "PAL" "0708"
  [5,] "PAL0708" "PAL" "0708"
  [6,] "PAL0708" "PAL" "0708"
 [ reached getOption("max.print") -- omitted 338 rows ]

1.4 str_match_all

Finally we have str_match_all which returns every time the pattern is matched rather than just the first time in an observation.

str_match_all(sentences, ".t.")[1]

[[1]]
     [,1] 
[1,] " th"
[2,] "oth"

# compare this with

str_match(sentences, ".t.")[1]

[1] " th"

# now lets see the difference in the sentence

sentences[1]

[1] "The birch canoe slid on the smooth planks."

1.5 Is it helpful?

Discuss whether you think this function is useful for you and your work. Is it the best thing since sliced bread, or is it not really relevant to your work?

Yes, I believe it is very helpful but I also must admit it can feel very finicky to use. First you need to get the regex right. Then you need to deal with the output and input not always working nicely with dplyr.

I believe one of the big uses for str_match is for finding things like phone numbers in large character vectors.

Here is what I think may be a more real world usage of str_match:

penguins_raw %>% 
  select(Comments)

# A tibble: 344 × 1
   Comments                             
   <chr>                                
 1 Not enough blood for isotopes.       
 2 <NA>                                 
 3 <NA>                                 
 4 Adult not sampled.                   
 5 <NA>                                 
 6 <NA>                                 
 7 Nest never observed with full clutch.
 8 Nest never observed with full clutch.
 9 No blood sample obtained.            
10 No blood sample obtained for sexing. 
# ℹ 334 more rows

# Looking at the penguin_raw data there is a comments variable. If I were to want to find all observations that included a comment referencing a nest I could do so like this:

penguins_raw %>%  
  mutate(match = str_match(Comments, pattern = ".*[Nn]est.*")) %>% 
  filter(!is.na(match)) %>% 
  select(match, everything())

Warning: Using one column matrices in `filter()` was deprecated in dplyr 1.1.0.
ℹ Please use one dimensional logical vectors instead.

# A tibble: 36 × 18
   match[,1]               studyName `Sample Number` Species Region Island Stage
   <chr>                   <chr>               <dbl> <chr>   <chr>  <chr>  <chr>
 1 Nest never observed wi… PAL0708                 7 Adelie… Anvers Torge… Adul…
 2 Nest never observed wi… PAL0708                 8 Adelie… Anvers Torge… Adul…
 3 Nest never observed wi… PAL0708                29 Adelie… Anvers Biscoe Adul…
 4 Nest never observed wi… PAL0708                30 Adelie… Anvers Biscoe Adul…
 5 Nest never observed wi… PAL0708                39 Adelie… Anvers Dream  Adul…
 6 Nest never observed wi… PAL0708                40 Adelie… Anvers Dream  Adul…
 7 Nest never observed wi… PAL0809                69 Adelie… Anvers Torge… Adul…
 8 Nest never observed wi… PAL0809                70 Adelie… Anvers Torge… Adul…
 9 Nest never observed wi… PAL0910               121 Adelie… Anvers Torge… Adul…
10 Nest never observed wi… PAL0910               122 Adelie… Anvers Torge… Adul…
# ℹ 26 more rows
# ℹ 11 more variables: `Individual ID` <chr>, `Clutch Completion` <chr>,
#   `Date Egg` <date>, `Culmen Length (mm)` <dbl>, `Culmen Depth (mm)` <dbl>,
#   `Flipper Length (mm)` <dbl>, `Body Mass (g)` <dbl>, Sex <chr>,
#   `Delta 15 N (o/oo)` <dbl>, `Delta 13 C (o/oo)` <dbl>, Comments <chr>