Web Scraping: A Primer

Quang Nguyen

Department of Statistics & Data Science
Carnegie Mellon University

SURE 2023 - CMSACamp

@qntkhvn       qntkhvn       qntkhvn.netlify.app

Slides: qntkhvn.github.io/webscraping

First, a moment of appreciation

  • Sports analytics today would be so much different without web scraping

    • Publicly available data

    • Reproducible research

  • A breakthrough…

Overview

Goal: you should be able to use R to scrape an HTML table from the web after this lecture, and hopefully more.

Agenda:

  • Webpage basics

  • Web scraping with rvest

    • Featuring stringr
  • APIs

  • Responsible web scraping

    • Best practices

    • Featuring polite

Material credits

I highly recommend the Web scraping chapter in R4DS (2e) for a neat basic overview of webpage structure and web scraping.

Webpage basics

  • HTML (Hyper Text Markup Language) defines the content and structure of a webpage

    • An HTML page contains various elements (headers, paragraphs, tables…)

    • HTML tags define where an element begins and ends

    • Opening and closing tags have the forms <tagname> and </tagname> (e.g., <table> and </table>)

  • CSS (Cascading Style Sheets) defines the appearance of HTML elements (i.e. whether the webpage is pretty or ugly)

    • CSS selectors are patterns used to select the elements to be style

    • CSS selectors can be used to extract elements from a webpage

Web scraping in R

Two widely-used packages:

  • rvest: simple, tidyverse friendly, static data

  • RSelenium: more advanced, dynamic data

We will focus on rvest today

install.packages("rvest")

(For the python fans, you can do it with Beautiful Soup and Selenium)

Typical web scraping workflow

  • (Survey the webpage)
  • Scrape the webpage content
  • Data organization and cleaning (most of the process)

    • Extracting elements (e.g., tables, links, etc.)

    • Common data manipulation tasks (e.g., with dplyr)

    • Handle strings (e.g, stringr, stringi) and regular expressions (regex)

  • Generalization: write functions (and develop packages)

    • For instance, in sports, you may be interested in data for not just 1 season/player/team/etc., but multiple

And make sure to…

  • Inspect the output at each step

  • Consult the help documentations

Scraping HTML tables

  • Tasks: Scrape the NHL Leaders table

    • Read the HTML page into R

    • Grab the CSS selector for the table in the browser

    • Write scraping code in R

    • Perform the following data cleaning steps

      • Create an indicator variable for whether a player is in the Hall of Fame

      • Remove the asterisk (*) from the Player column

      • Remove the dot (.) from the Rank column

Read the HTML page into R

  • We use read_html() to read in the HTML page based on a specified webpage’s URL
library(rvest)
library(tidyverse)
nhl_url <- "https://www.hockey-reference.com/leaders/games_played_career.html"
nhl_url |> 
  read_html()
{html_document}
<html data-version="klecko-" data-root="/home/hr/build" lang="en" class="no-js">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="hr">\n<div id="wrap">\n  \n  <div id="header" role="banner"> ...
  • The next step is to extract the NHL Leaders table from the HTML. This can be accomplished by finding the CSS selector of the table.

How to find a table’s CSS selector

  1. Move the cursor close to the table you want to scrape (e.g., near top of the table). Right click and select Inspect (Chrome or Firefox) or Inspect element (Safari)
  1. The Developer Tools will open in your browser. Pay attention to the Elements pane (Chrome or Safari) or Inspector pane (Firefox)

    • The HTML element corresponding to the webpage area (close to the table) mentioned in Step 1 is highlighted

    • Hovering over different HTML elements will highlight different parts of the webpage

  1. To find the CSS selector for the table, hover over different lines and stop at the line where only the entire table is highlighted

    • This will often be a line with a <table> opening tag
  1. Right click on the line, then choose Copy \(\rightarrow\) Copy selector (Chrome) or Copy \(\rightarrow\) CSS Selector (Firefox) or Copy \(\rightarrow\) Selector Path (Safari)

Scraping HTML tables

The following video shows how to find the CSS selector (in Chrome)

Scraping HTML tables

  • Use html_element() to get the element associated with the CSS selector (table in this case)

  • Inside html_element(), specify the CSS selector that we copied earlier

  • This returns an HTML “node”

nhl_url |> 
  read_html() |> 
  html_element(css = "#stats_career_NHL")
{html_node}
<table class="suppress_glossary suppress_csv sortable stats_table" id="stats_career_NHL" data-cols-to-freeze="1,2">
[1] <caption>NHL Leaders Table</caption>
[2] <thead><tr>\n<th class="right">Rank</th>\n<th class="left">Player</th>\n< ...
[3] <tbody>\n<tr>\n<td class="right">1.</td>\n<td class="left"><a href="/play ...

Scraping HTML tables

  • Finally, use html_table() to convert to a tibble (data frame) in R

  • This completes our scraping process

nhl_tbl <- nhl_url |> 
  read_html() |> 
  html_element(css = "#stats_career_NHL") |> 
  html_table()

nhl_tbl
# A tibble: 250 × 4
   Rank  Player           Years      GP
   <chr> <chr>            <chr>   <int>
 1 1.    Patrick Marleau  1997-21  1779
 2 2.    Gordie Howe*     1946-80  1767
 3 3.    Mark Messier*    1979-04  1756
 4 4.    Jaromír Jágr     1990-18  1733
 5 5.    Ron Francis*     1981-04  1731
 6 6.    Joe Thornton     1997-22  1714
 7 7.    Zdeno Chára      1997-22  1680
 8 8.    Mark Recchi*     1988-11  1652
 9 9.    Chris Chelios*   1983-10  1651
10 10.   Dave Andreychuk* 1982-06  1639
# ℹ 240 more rows

Remarks: webpage elements

  • Note that html_table() only works when the element specified in html_element() is a table
  • There are other things to be extracted from an element

    • To retrieve text from an element, use html_text2()

    • To retrieve an attribute (e.g., hyperlink) from an element, use html_attr() and html_attrs()

    • We will touch on these 2 cases later on

  • The inspection step (for obtaining CSS selector) can be skipped

  • html_table() can be called right after read_html()

  • This outputs a list of all the tables existed on the webpage

nhl_tbl_list <- nhl_url |> 
  read_html() |> 
  html_table()
nhl_tbl_list
[[1]]
# A tibble: 250 × 4
   Rank  Player           Years      GP
   <chr> <chr>            <chr>   <int>
 1 1.    Patrick Marleau  1997-21  1779
 2 2.    Gordie Howe*     1946-80  1767
 3 3.    Mark Messier*    1979-04  1756
 4 4.    Jaromír Jágr     1990-18  1733
 5 5.    Ron Francis*     1981-04  1731
 6 6.    Joe Thornton     1997-22  1714
 7 7.    Zdeno Chára      1997-22  1680
 8 8.    Mark Recchi*     1988-11  1652
 9 9.    Chris Chelios*   1983-10  1651
10 10.   Dave Andreychuk* 1982-06  1639
# ℹ 240 more rows

[[2]]
# A tibble: 50 × 4
   Rank  Player          Years      GP
   <chr> <chr>           <chr>   <int>
 1 1.    André Lacroix   1972-79   551
 2 2.    Ron Plumb       1972-79   549
 3 3.    Paul Shmyr      1972-79   511
 4 4.    Michel Parizeau 1972-79   509
 5 5.    Mike Antonovich 1972-79   486
 6 6.    Rick Ley        1972-79   478
 7 7.    John McKenzie   1972-79   477
 8 8.    Blair MacDonald 1973-79   476
 9 9.    Larry Pleau     1972-79   468
10 10.   Poul Popiel     1972-78   467
# ℹ 40 more rows
  • We can then subset out the desired table based on its index within the list
nhl_tbl_list[[1]]
# A tibble: 250 × 4
   Rank  Player           Years      GP
   <chr> <chr>            <chr>   <int>
 1 1.    Patrick Marleau  1997-21  1779
 2 2.    Gordie Howe*     1946-80  1767
 3 3.    Mark Messier*    1979-04  1756
 4 4.    Jaromír Jágr     1990-18  1733
 5 5.    Ron Francis*     1981-04  1731
 6 6.    Joe Thornton     1997-22  1714
 7 7.    Zdeno Chára      1997-22  1680
 8 8.    Mark Recchi*     1988-11  1652
 9 9.    Chris Chelios*   1983-10  1651
10 10.   Dave Andreychuk* 1982-06  1639
# ℹ 240 more rows
  • For this specific example, there are only 2 tables, so there doesn’t seem to be any issue. But what if there are a lot more than 2?

Data cleaning: working with stringr

The tidyverse offers stringr for string manipulation.

Check out the stringr cheatsheet.

  • The second page gives a neat overview of regular expressions (special patterns for string matching)

  • Note that some characters in an R string must be represented as special characters

    • $ * + . ? [ ] ^ { } | ( ) \

    • Use a double backslash (\\) 1 to “escape” these characters (e.g., \\*)

str_detect()

  • Returns TRUE/FALSE, showing whether a string matches a specified pattern
str_detect("Gordie Howe*", "\\*")
[1] TRUE
str_detect("Gordie Howe", "\\*")
[1] FALSE
  • str_detect() can be used with filter() or within a conditional statement (e.g., with ifelse() or case_when())

    • Recall that filter() subsets out rows that satisfy a condition

str_detect()

Suppose we want to keep only the HOF players. We can detect all rows with the asterisk (*) with str_detect().

nhl_tbl |> 
  filter(str_detect(Player, "\\*"))
# A tibble: 78 × 4
   Rank  Player            Years      GP
   <chr> <chr>             <chr>   <int>
 1 2.    Gordie Howe*      1946-80  1767
 2 3.    Mark Messier*     1979-04  1756
 3 5.    Ron Francis*      1981-04  1731
 4 8.    Mark Recchi*      1988-11  1652
 5 9.    Chris Chelios*    1983-10  1651
 6 10.   Dave Andreychuk*  1982-06  1639
 7 11.   Scott Stevens*    1982-04  1635
 8 12.   Larry Murphy*     1980-01  1615
 9 13.   Ray Bourque*      1979-01  1612
10 14.   Nicklas Lidström* 1991-12  1564
# ℹ 68 more rows

Back to the example…

Recall that one of the data cleaning task is to create an indicator variable for whether a player is in the HOF

nhl_tbl |> 
  mutate(HOF = ifelse(str_detect(Player, "\\*"), 1, 0))
# A tibble: 250 × 5
   Rank  Player           Years      GP   HOF
   <chr> <chr>            <chr>   <int> <dbl>
 1 1.    Patrick Marleau  1997-21  1779     0
 2 2.    Gordie Howe*     1946-80  1767     1
 3 3.    Mark Messier*    1979-04  1756     1
 4 4.    Jaromír Jágr     1990-18  1733     0
 5 5.    Ron Francis*     1981-04  1731     1
 6 6.    Joe Thornton     1997-22  1714     0
 7 7.    Zdeno Chára      1997-22  1680     0
 8 8.    Mark Recchi*     1988-11  1652     1
 9 9.    Chris Chelios*   1983-10  1651     1
10 10.   Dave Andreychuk* 1982-06  1639     1
# ℹ 240 more rows

str_remove()

  • str_remove() takes in a string and removes a specified pattern.
str_remove("Gordie Howe*", "\\*")
[1] "Gordie Howe"
  • str_remove() can be used with mutate()

    • Recall that mutate() creates/modifies variables that are functions of existing variables
  • There’s a related function named str_remove_all(). The following code illustrates the difference.
str_remove("*Gordie* Howe*", "\\*")
[1] "Gordie* Howe*"
str_remove_all("*Gordie* Howe*", "\\*")
[1] "Gordie Howe"

Back to the example…

  • Now we build upon the previous code to finish the data cleaning process.

  • Recall than we want to remove the asterisk (*) and dot (.) from the Player and Rank columns, respectively.

nhl_tbl_cleaned <- nhl_tbl |> 
  mutate(HOF = ifelse(str_detect(Player, "\\*"), 1, 0),
         Player = str_remove(Player, "\\*"),
         Rank = str_remove(Rank, "\\."))

nhl_tbl_cleaned
# A tibble: 250 × 5
   Rank  Player          Years      GP   HOF
   <chr> <chr>           <chr>   <int> <dbl>
 1 1     Patrick Marleau 1997-21  1779     0
 2 2     Gordie Howe     1946-80  1767     1
 3 3     Mark Messier    1979-04  1756     1
 4 4     Jaromír Jágr    1990-18  1733     0
 5 5     Ron Francis     1981-04  1731     1
 6 6     Joe Thornton    1997-22  1714     0
 7 7     Zdeno Chára     1997-22  1680     0
 8 8     Mark Recchi     1988-11  1652     1
 9 9     Chris Chelios   1983-10  1651     1
10 10    Dave Andreychuk 1982-06  1639     1
# ℹ 240 more rows

Scraping practice

Example: Frauen Bundesliga (German women’s soccer league)

  • URL: https://fbref.com/en/comps/183/2017-2018/2017-2018-Frauen-Bundesliga-Stats

  • The link above provides stats for the 2017-2018 season

    • Scrape the Overall table under Regular season

    • (Time permitting) Write a general function for scraping data for any specified season.

      • Hint: change the years in the URL and CSS selector

      • Get data for every season between 2016-2017 and 2019-2020 and combine them into a single table

Scraping practice

  • Scrape 2017-2018 overall standings
fb_url <- "https://fbref.com/en/comps/183/2017-2018/2017-2018-Frauen-Bundesliga-Stats"
fb_tbl <- fb_url |>
  read_html() |>
  html_element(css = "#results2017-20181831_overall") |>
  html_table()
fb_tbl
# A tibble: 12 × 15
      Rk Squad             MP     W     D     L    GF    GA    GD   Pts `Pts/MP`
   <int> <chr>          <int> <int> <int> <int> <int> <int> <int> <int>    <dbl>
 1     1 Wolfsburg         22    18     2     2    56     8    48    56     2.55
 2     2 Bayern Munich     22    17     2     3    62    15    47    53     2.41
 3     3 Freiburg          22    15     3     4    50    15    35    48     2.18
 4     4 Turbine Potsd…    22    13     6     3    50    21    29    45     2.05
 5     5 Essen             22    12     3     7    43    30    13    39     1.77
 6     6 FFC Frankfurt     22    10     1    11    29    25     4    31     1.41
 7     7 Sand              22     9     3    10    32    34    -2    30     1.36
 8     8 Hoffenheim        22     8     1    13    22    32   -10    25     1.14
 9     9 MSV Duisburg      22     6     0    16    16    33   -17    18     0.82
10    10 Werder Bremen     22     3     5    14    26    59   -33    14     0.64
11    11 Köln              22     3     2    17     8    78   -70    11     0.5 
12    12 USV Jena          22     2     4    16    12    56   -44    10     0.45
# ℹ 4 more variables: Attendance <chr>, `Top Team Scorer` <chr>,
#   Goalkeeper <chr>, Notes <chr>

Scraping practice

  • General function
get_fb_data <- function(start_year) {

  year_str <- str_c(start_year, start_year + 1, sep = "-")

  fb_url <- str_c("https://fbref.com/en/comps/183/", year_str, "/", year_str, "-Frauen-Bundesliga-Stats")

  year_css <- str_c("#results", year_str, "1831_overall")

  fb_tbl <- fb_url |>
    read_html() |>
    html_element(css = year_css) |>
    html_table() |>
    mutate(season = year_str)

  return(fb_tbl)
}

seasons <- 2016:2019
fb_tbl_full <- seasons |>
  map(get_fb_data) |>
  list_rbind()
# fb_tbl_full

Continuing with the Frauen Bundesliga 2017-2018 season stats example…

  • Notice that the table on the webpage also contains team logos, URLs, etc.

  • These information were not extracted with html_table()

  • Suppose we’re also interested in getting the team URLs and logos

  • We first store the HTML node for the table in an object (i.e., everything up to html_element() with a specified CSS selector for the table)

  • This can then be used to obtain the images and team links based on their tags.

fb_url <- "https://fbref.com/en/comps/183/2017-2018/2017-2018-Frauen-Bundesliga-Stats"
fb_node <- fb_url |> 
  read_html() |> 
  html_element(css = "#results2017-20181831_overall")
fb_node
{html_node}
<table class="stats_table sortable min_width force_mobilize" id="results2017-20181831_overall" data-cols-to-freeze=",2">
[1] <caption>Regular season Table</caption>
[2] <colgroup>\n<col>\n<col>\n<col>\n<col>\n<col>\n<col>\n<col>\n<col>\n<col> ...
[3] <thead><tr>\n<th aria-label="Rank" data-stat="rank" scope="col" class=" p ...
[4] <tbody>\n<tr>\n<th scope="row" class="right qualifier qualification_indic ...
  • To get all image elements, we can use html_elements() and specify the img tag

    • html_elements() is the “plural” version of html_element(), since we want ALL image elements, not just one (Honestly, if you don’t remember the differences, just try both and see which one is suitable)
fb_node |> 
  html_elements("img")
{xml_nodeset (12)}
 [1] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
 [2] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
 [3] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
 [4] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
 [5] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
 [6] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
 [7] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
 [8] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
 [9] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
[10] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
[11] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
[12] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
  • Notice that each image contains different attributes such as height, width, image path (src)
  • Suppose we want to grab the image path only (for future plotting purpose), we can use html_attr() to get the src attribute
fb_imgs <- fb_node |> 
  html_elements("img") |> 
  html_attr("src")
fb_imgs
 [1] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.a1393014.png"
 [2] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.51ec22be.png"
 [3] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.b4de690d.png"
 [4] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.de550500.png"
 [5] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.becc1dd0.png"
 [6] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.77d2e598.png"
 [7] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.0cc34cf4.png"
 [8] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.87705c62.png"
 [9] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.0580d9a9.png"
[10] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.7adbf480.png"
[11] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.88ddc98e.png"
[12] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.765472c9.png"
  • To get all the URLs in the Squad column, we can first use html_elements() again and specify the "a" tag

    • The <a> (anchor) tag defines a hyperlink
fb_node |> 
  html_elements("a")
{xml_nodeset (38)}
 [1] <a href="/en/squads/a1393014/2017-2018/Wolfsburg-Women-Stats">Wolfsburg</a>
 [2] <a href="/en/players/363b99a4/Pernille-Harder">Pernille Harder</a>
 [3] <a href="/en/players/992b30a1/Almuth-Schult">Almuth Schult</a>
 [4] <a href="/en/squads/51ec22be/2017-2018/Bayern-Munich-Women-Stats">Bayern ...
 [5] <a href="/en/players/6862731d/Fridolina-Rolfo">Fridolina Rolfö</a>
 [6] <a href="/en/players/c7dc2a33/Manuela-Zinsberger">Manuela Zinsberger</a>
 [7] <a href="/en/squads/b4de690d/2017-2018/Freiburg-Women-Stats">Freiburg</a>
 [8] <a href="/en/players/3cd04ba1/Lina-Magull">Lina Magull</a>
 [9] <a href="/en/players/82c4f339/Laura-Benkarth">Laura Benkarth</a>
[10] <a href="/en/squads/de550500/2017-2018/Turbine-Potsdam-Stats">Turbine Po ...
[11] <a href="/en/players/8b5f141c/Svenja-Huth">Svenja Huth</a>
[12] <a href="/en/players/38bbb38c/Lisa-Schmitz">Lisa Schmitz</a>
[13] <a href="/en/squads/becc1dd0/2017-2018/Essen-Stats">Essen</a>
[14] <a href="/en/players/5a20e7f0/Linda-Dallmann">Linda Dallmann</a>
[15] <a href="/en/players/8699d87d/Lisa-Weiss">Lisa Weiß</a>
[16] <a href="/en/squads/77d2e598/2017-2018/FFC-Frankfurt-Stats">FFC Frankfur ...
[17] <a href="/en/players/6698c9f0/Jackie-Groenen">Jackie Groenen</a>
[18] <a href="/en/players/0dfe3e98/Bryane-Heaberlin">Bryane Heaberlin</a>
[19] <a href="/en/squads/0cc34cf4/2017-2018/Sand-Stats">Sand</a>
[20] <a href="/en/players/0f55e5ea/Nina-Burger">Nina Burger</a>
...
  • Within <a>, we can grab the href attribute with html_attr()

    • href indicates the URL/page associated with the link
fb_node |> 
  html_elements("a") |> 
  html_attr("href")
 [1] "/en/squads/a1393014/2017-2018/Wolfsburg-Women-Stats"    
 [2] "/en/players/363b99a4/Pernille-Harder"                   
 [3] "/en/players/992b30a1/Almuth-Schult"                     
 [4] "/en/squads/51ec22be/2017-2018/Bayern-Munich-Women-Stats"
 [5] "/en/players/6862731d/Fridolina-Rolfo"                   
 [6] "/en/players/c7dc2a33/Manuela-Zinsberger"                
 [7] "/en/squads/b4de690d/2017-2018/Freiburg-Women-Stats"     
 [8] "/en/players/3cd04ba1/Lina-Magull"                       
 [9] "/en/players/82c4f339/Laura-Benkarth"                    
[10] "/en/squads/de550500/2017-2018/Turbine-Potsdam-Stats"    
[11] "/en/players/8b5f141c/Svenja-Huth"                       
[12] "/en/players/38bbb38c/Lisa-Schmitz"                      
[13] "/en/squads/becc1dd0/2017-2018/Essen-Stats"              
[14] "/en/players/5a20e7f0/Linda-Dallmann"                    
[15] "/en/players/8699d87d/Lisa-Weiss"                        
[16] "/en/squads/77d2e598/2017-2018/FFC-Frankfurt-Stats"      
[17] "/en/players/6698c9f0/Jackie-Groenen"                    
[18] "/en/players/0dfe3e98/Bryane-Heaberlin"                  
[19] "/en/squads/0cc34cf4/2017-2018/Sand-Stats"               
[20] "/en/players/0f55e5ea/Nina-Burger"                       
[21] "/en/players/9474cd93/Carina-Schluter"                   
[22] "/en/squads/87705c62/2017-2018/Hoffenheim-Women-Stats"   
[23] "/en/players/9e30ae90/Isabella-Hartig"                   
[24] "/en/players/277cdd4e/Tabea-Wassmuth"                    
[25] "/en/players/7c51bea4/Friederike-Abt"                    
[26] "/en/squads/0580d9a9/2017-2018/MSV-Duisburg-Women-Stats" 
[27] "/en/players/88be98d1/Kathleen-Radtke"                   
[28] "/en/players/781a82de/Lena-Nuding"                       
[29] "/en/squads/7adbf480/2017-2018/Werder-Bremen-Women-Stats"
[30] "/en/players/e070cdf6/Nina-Luhrssen"                     
[31] "/en/players/e18dc3a3/Nora-Clausen"                      
[32] "/en/players/37469b31/Anneke-Borbe"                      
[33] "/en/squads/88ddc98e/2017-2018/Koln-Women-Stats"         
[34] "/en/players/998749f3/Amber-Hearn"                       
[35] "/en/players/9abeb65b/Anne-Kathrine-Kremer"              
[36] "/en/squads/765472c9/2017-2018/USV-Jena-Stats"           
[37] "/en/players/b004aab5/Amelia-Pietrangelo"                
[38] "/en/players/90c69bf5/Justien-Odeurs"                    
  • Notice that the previous output is a vector with all hyperlinks in the table, including all squads and players

  • Since we want squads only, we need to subset out all strings with the keyword "squads"

  • The function str_subset() comes in handy here

    • str_subset() returns only the vector elements that matches a pattern
fb_links <- fb_node |> 
  html_elements("a") |> 
  html_attr("href") |> 
  str_subset("squads")
fb_links
 [1] "/en/squads/a1393014/2017-2018/Wolfsburg-Women-Stats"    
 [2] "/en/squads/51ec22be/2017-2018/Bayern-Munich-Women-Stats"
 [3] "/en/squads/b4de690d/2017-2018/Freiburg-Women-Stats"     
 [4] "/en/squads/de550500/2017-2018/Turbine-Potsdam-Stats"    
 [5] "/en/squads/becc1dd0/2017-2018/Essen-Stats"              
 [6] "/en/squads/77d2e598/2017-2018/FFC-Frankfurt-Stats"      
 [7] "/en/squads/0cc34cf4/2017-2018/Sand-Stats"               
 [8] "/en/squads/87705c62/2017-2018/Hoffenheim-Women-Stats"   
 [9] "/en/squads/0580d9a9/2017-2018/MSV-Duisburg-Women-Stats" 
[10] "/en/squads/7adbf480/2017-2018/Werder-Bremen-Women-Stats"
[11] "/en/squads/88ddc98e/2017-2018/Koln-Women-Stats"         
[12] "/en/squads/765472c9/2017-2018/USV-Jena-Stats"           
  • Finally, we can add the team images and links as two new columns in our table
fb_tbl <- fb_node |> 
  html_table() |> 
  mutate(img = fb_imgs,
         link = fb_links)

Let’s make a scatterplot of total number of goals scored (GF - goals for) and goals conceded (GA - goals against), and display the team logos.

library(ggimage)
fb_tbl |>
  mutate(img = str_remove(img, "mini.")) |> 
  ggplot(aes(GA, GF)) +
  geom_image(aes(image = img), size = 0.08, asp = 1) +
  theme_classic()

Scraping text

Scraping text

  • Just as before, after reading in the page, we inspect and grab the CSS selector for only this blurb of text
wimbledon_url <- "https://en.wikipedia.org/wiki/2009_Wimbledon_Championships_-_Women's_singles" 
wimbledon_url |> 
  read_html() |> 
  html_element("#mw-content-text > div.mw-parser-output > div:nth-child(13)")
{html_node}
<div class="div-col">
[1] <dl>\n<dd>\n<span style="visibility:hidden;color:transparent;">0</span><a ...
[2] <dl>\n<dd>\n<a href="#Section_1">17</a>.   <span class="flagicon"><span c ...

Scraping text

  • Then, we can retrieve text from this element with html_text2()
wimbledon_url |> 
  read_html() |> 
  html_element("#mw-content-text > div.mw-parser-output > div:nth-child(13)") |> 
  html_text2()
[1] "01. Dinara Safina (semifinals)\n02. Serena Williams (champion)\n03. Venus Williams (final)\n04. Elena Dementieva (semifinals)\n05. Svetlana Kuznetsova (third round)\n06. Jelena Janković (third round)\n07. Vera Zvonareva (third round, withdrew due to an ankle injury)\n08. Victoria Azarenka (quarterfinals)\n09. Caroline Wozniacki (fourth round)\n10. Nadia Petrova (fourth round)\n11. Agnieszka Radwańska (quarterfinals)\n12. Marion Bartoli (third round)\n13. Ana Ivanovic (fourth round, retired due to a thigh injury)\n14. Dominika Cibulková (third round)\n15. Flavia Pennetta (third round)\n16. Zheng Jie (second round)\n17. Amélie Mauresmo (fourth round)\n18. Samantha Stosur (third round)\n19. Li Na (third round)\n20. Anabel Medina Garrigues (third round)\n21. Patty Schnyder (first round)\n22. Alizé Cornet (first round)\n23. Aleksandra Wozniak (first round)\n24. Maria Sharapova (second round)\n25. Kaia Kanepi (first round)\n26. Virginie Razzano (fourth round)\n27. Alisa Kleybanova (second round)\n28. Sorana Cîrstea (third round)\n29. Sybille Bammer (first round)\n30. Ágnes Szávay (first round)\n31. Anastasia Pavlyuchenkova (second round)\n32. Anna Chakvetadze (first round)"

Scraping text

  • Notice this output a single string of all the text

  • Each combination of seed-player-result is separated by a newline character \n

  • There are many ways you can do to separate these - one way is with str_split_1()

    • str_split_1() splits a single string into a character vector based on a pattern

    • Other ways include: str_split() then unlist(), or read_lines(), and many more

Scraping text

wimbledon_info <- wimbledon_url |> 
  read_html() |> 
  html_element("#mw-content-text > div.mw-parser-output > div:nth-child(13)") |> 
  html_text2() |>
  str_split_1("\\n")
wimbledon_info
 [1] "01. Dinara Safina (semifinals)"                                   
 [2] "02. Serena Williams (champion)"                                   
 [3] "03. Venus Williams (final)"                                       
 [4] "04. Elena Dementieva (semifinals)"                                
 [5] "05. Svetlana Kuznetsova (third round)"                            
 [6] "06. Jelena Janković (third round)"                                
 [7] "07. Vera Zvonareva (third round, withdrew due to an ankle injury)"
 [8] "08. Victoria Azarenka (quarterfinals)"                            
 [9] "09. Caroline Wozniacki (fourth round)"                            
[10] "10. Nadia Petrova (fourth round)"                                 
[11] "11. Agnieszka Radwańska (quarterfinals)"                          
[12] "12. Marion Bartoli (third round)"                                 
[13] "13. Ana Ivanovic (fourth round, retired due to a thigh injury)"   
[14] "14. Dominika Cibulková (third round)"                             
[15] "15. Flavia Pennetta (third round)"                                
[16] "16. Zheng Jie (second round)"                                     
[17] "17. Amélie Mauresmo (fourth round)"                               
[18] "18. Samantha Stosur (third round)"                                
[19] "19. Li Na (third round)"                                          
[20] "20. Anabel Medina Garrigues (third round)"                        
[21] "21. Patty Schnyder (first round)"                                 
[22] "22. Alizé Cornet (first round)"                                   
[23] "23. Aleksandra Wozniak (first round)"                             
[24] "24. Maria Sharapova (second round)"                               
[25] "25. Kaia Kanepi (first round)"                                    
[26] "26. Virginie Razzano (fourth round)"                              
[27] "27. Alisa Kleybanova (second round)"                              
[28] "28. Sorana Cîrstea (third round)"                                 
[29] "29. Sybille Bammer (first round)"                                 
[30] "30. Ágnes Szávay (first round)"                                   
[31] "31. Anastasia Pavlyuchenkova (second round)"                      
[32] "32. Anna Chakvetadze (first round)"                               

Scraping text

  • As a final step, you can try to turn the vector into a table (as a single column), then clean up and create 3 columns: seed, player, and result

(This might involve tasks like extracting text between parentheses, locating special characters like . ( ), etc. — check out this blog post)

(Web) API basics

An API (Application Programming Interface) connects computer programs to each other

Web APIs provide interactions between a client device and a web server using the Hypertext Transfer Protocol (HTTP)

  • Clients send a (HTTP) request and receive a response (in JSON or XML)

  • Many organizations have their own public API, which can be used to access data

  • Fortunately, there exists many R packages (sports and non-sports) that provides access to APIs for obtaining data

The httr package

httr offers a general way of getting data from an API, via different tools for working with HTTP

  • GET() sends a request to an API and captures the response

  • content() extracts out the data from the response

These 2 functions are illustrated in the next example

There are many other useful functions in httr

  • For example, PUT() and POST() can be used to send data to APIs

  • Other popular verbs are PATCH(), HEAD(), and DELETE()

Pulling data from APIs

Example: Formula One API (Inspiration: Tidy Tuesday 2021-09-07)

Pulling data from APIs

  • First, we can use GET() to send a request to the API. We then receive the data via a response
library(httr)
f1_api <- "http://ergast.com/api/f1/constructorStandings/1/constructors.json"
f1_response <- f1_api |> 
  GET()
f1_response
Response [http://ergast.com/api/f1/constructorStandings/1/constructors.json]
  Date: 2023-07-20 14:18
  Status: 200
  Content-Type: application/json; charset=utf-8
  Size: 2.4 kB
# check the type of the response object and whether we get an error
# http_type(f1_response)
# http_error(f1_response)

Pulling data from APIs

  • Next, we want to get the data from the response by calling content(). We can then view the structure of the content object.
f1_content <- f1_response |>   
  content()
glimpse(f1_content)
List of 1
 $ MRData:List of 7
  ..$ xmlns           : chr "http://ergast.com/mrd/1.5"
  ..$ series          : chr "f1"
  ..$ url             : chr "http://ergast.com/api/f1/constructorstandings/1/constructors.json"
  ..$ limit           : chr "30"
  ..$ offset          : chr "0"
  ..$ total           : chr "17"
  ..$ ConstructorTable:List of 2
  .. ..$ constructorStandings: chr "1"
  .. ..$ Constructors        :List of 17

Pulling data from APIs

  • Finally, based on the content structure, we can get a list of constructors. Each list consists of constructor ID, URL, name, and nationality.
f1_constructor_list <- f1_content |> 
  pluck("MRData") |> 
  pluck("ConstructorTable") |> 
  pluck("Constructors")
# f1_constructor_list
f1_constructor_list[[1]]
$constructorId
[1] "benetton"

$url
[1] "http://en.wikipedia.org/wiki/Benetton_Formula"

$name
[1] "Benetton"

$nationality
[1] "Italian"

Pulling data from APIs

  • A few extra transformation steps will give us the desired table
f1_constructor_tbl <- f1_constructor_list |> 
  as_tibble_col(column_name = "info") |> # convert list to tibble
  unnest_wider(info) # unnest a list-column into columns
f1_constructor_tbl
# A tibble: 17 × 4
   constructorId url                                           name  nationality
   <chr>         <chr>                                         <chr> <chr>      
 1 benetton      http://en.wikipedia.org/wiki/Benetton_Formula Bene… Italian    
 2 brabham-repco http://en.wikipedia.org/wiki/Brabham          Brab… British    
 3 brawn         http://en.wikipedia.org/wiki/Brawn_GP         Brawn British    
 4 brm           http://en.wikipedia.org/wiki/BRM              BRM   British    
 5 cooper-climax http://en.wikipedia.org/wiki/Cooper_Car_Comp… Coop… British    
 6 ferrari       http://en.wikipedia.org/wiki/Scuderia_Ferrari Ferr… Italian    
 7 lotus-climax  http://en.wikipedia.org/wiki/Team_Lotus       Lotu… British    
 8 lotus-ford    http://en.wikipedia.org/wiki/Team_Lotus       Lotu… British    
 9 matra-ford    http://en.wikipedia.org/wiki/Matra            Matr… French     
10 mclaren       http://en.wikipedia.org/wiki/McLaren          McLa… British    
11 mercedes      http://en.wikipedia.org/wiki/Mercedes-Benz_i… Merc… German     
12 red_bull      http://en.wikipedia.org/wiki/Red_Bull_Racing  Red … Austrian   
13 renault       http://en.wikipedia.org/wiki/Renault_in_Form… Rena… French     
14 team_lotus    http://en.wikipedia.org/wiki/Team_Lotus       Team… British    
15 tyrrell       http://en.wikipedia.org/wiki/Tyrrell_Racing   Tyrr… British    
16 vanwall       http://en.wikipedia.org/wiki/Vanwall          Vanw… British    
17 williams      http://en.wikipedia.org/wiki/Williams_Grand_… Will… British    

Scrape responsibly!

The polite package: overview

install.packages("polite")

polite ensures that you’re respecting the robots.txt 1 and not submitting too many requests

  • bow() introduces the user to the host and asks for scraping permission

  • scrape() scrapes and retrieves data

(Sometimes, nod() is required as an intermediate step, to agree modification of session path with host)

The polite package: example

  • Example: Wimbledon Women’s singles (same as before)

  • First, pass the URL into bow() to get a “session” object

    • This gives information about the robots.txt and whether the webpage is scrapable
library(polite)
wimbledon_url <- "https://en.wikipedia.org/wiki/2009_Wimbledon_Championships_-_Women's_singles"
session <- wimbledon_url |> 
  bow()
session
<polite session> https://en.wikipedia.org/wiki/2009_Wimbledon_Championships_-_Women's_singles
    User-agent: polite R package
    robots.txt: 456 rules are defined for 33 bots
   Crawl delay: 5 sec
  The path is scrapable for this user-agent

The polite package: example

  • Now, use scrape() to get the data from the session previously created by bow()

    • This essentially replaces read_html() as seen earlier
session |> 
  scrape()
{html_document}
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="skin-vector skin-vector-search-vue mediawiki ltr sitedir-ltr ...

The polite package: example

  • The remaining steps are similar as before. We can use the same code as earlier for selecting the HTML element and retrieving text.
session |> 
  scrape() |> 
  html_element("#mw-content-text > div.mw-parser-output > div:nth-child(13)") |> 
  html_text2() |>
  str_split_1("\\n")
 [1] "01. Dinara Safina (semifinals)"                                   
 [2] "02. Serena Williams (champion)"                                   
 [3] "03. Venus Williams (final)"                                       
 [4] "04. Elena Dementieva (semifinals)"                                
 [5] "05. Svetlana Kuznetsova (third round)"                            
 [6] "06. Jelena Janković (third round)"                                
 [7] "07. Vera Zvonareva (third round, withdrew due to an ankle injury)"
 [8] "08. Victoria Azarenka (quarterfinals)"                            
 [9] "09. Caroline Wozniacki (fourth round)"                            
[10] "10. Nadia Petrova (fourth round)"                                 
[11] "11. Agnieszka Radwańska (quarterfinals)"                          
[12] "12. Marion Bartoli (third round)"                                 
[13] "13. Ana Ivanovic (fourth round, retired due to a thigh injury)"   
[14] "14. Dominika Cibulková (third round)"                             
[15] "15. Flavia Pennetta (third round)"                                
[16] "16. Zheng Jie (second round)"                                     
[17] "17. Amélie Mauresmo (fourth round)"                               
[18] "18. Samantha Stosur (third round)"                                
[19] "19. Li Na (third round)"                                          
[20] "20. Anabel Medina Garrigues (third round)"                        
[21] "21. Patty Schnyder (first round)"                                 
[22] "22. Alizé Cornet (first round)"                                   
[23] "23. Aleksandra Wozniak (first round)"                             
[24] "24. Maria Sharapova (second round)"                               
[25] "25. Kaia Kanepi (first round)"                                    
[26] "26. Virginie Razzano (fourth round)"                              
[27] "27. Alisa Kleybanova (second round)"                              
[28] "28. Sorana Cîrstea (third round)"                                 
[29] "29. Sybille Bammer (first round)"                                 
[30] "30. Ágnes Szávay (first round)"                                   
[31] "31. Anastasia Pavlyuchenkova (second round)"                      
[32] "32. Anna Chakvetadze (first round)"                               

More resources

Final words

  • Web scraping is an excellent means for gaining proficiency in data cleaning

    • It takes time - the more you play around the better you get

    • Inspect the output at each step

    • Consult the help documentations

  • Come up with fun personal projects (data viz, Shiny app, etc.), scrape data, and enjoy learning (and the struggle)

  • You can develop the next great sports “scrapR”1 package(s) (or even for your field of interest)