Web Scraping: A Primer

Quang Nguyen

Department of Statistics & Data Science
Carnegie Mellon University

SURE 2023 - CMSACamp

@qntkhvn qntkhvn qntkhvn.netlify.app

Slides: qntkhvn.github.io/webscraping

First, a moment of appreciation

Sports analytics today would be so much different without web scraping
- Publicly available data
- Reproducible research
A breakthrough…

Overview

Goal: you should be able to use R to scrape an HTML table from the web after this lecture, and hopefully more.

Agenda:

Webpage basics
Web scraping with rvest
- Featuring stringr
APIs
Responsible web scraping
- Best practices
- Featuring polite

Material credits

Tan Ho’s excellent youtube video
Data Wrangling lecture notes from the Stanford Data Challenge Lab (DCL) course

I highly recommend the Web scraping chapter in R4DS (2e) for a neat basic overview of webpage structure and web scraping.

Webpage basics

HTML (Hyper Text Markup Language) defines the content and structure of a webpage
- An HTML page contains various elements (headers, paragraphs, tables…)
- HTML tags define where an element begins and ends
- Opening and closing tags have the forms <tagname> and </tagname> (e.g., <table> and </table>)

CSS (Cascading Style Sheets) defines the appearance of HTML elements (i.e. whether the webpage is pretty or ugly)
- CSS selectors are patterns used to select the elements to be style
- CSS selectors can be used to extract elements from a webpage

Web scraping in `R`

Two widely-used packages:

rvest: simple, tidyverse friendly, static data
RSelenium: more advanced, dynamic data

We will focus on rvest today

install.packages("rvest")

(For the python fans, you can do it with Beautiful Soup and Selenium)

Typical web scraping workflow

(Survey the webpage)

Scrape the webpage content

Data organization and cleaning (most of the process)
- Extracting elements (e.g., tables, links, etc.)
- Common data manipulation tasks (e.g., with dplyr)
- Handle strings (e.g, stringr, stringi) and regular expressions (regex)

Generalization: write functions (and develop packages)
- For instance, in sports, you may be interested in data for not just 1 season/player/team/etc., but multiple

And make sure to…

Inspect the output at each step
Consult the help documentations

Scraping HTML tables

Example: NHL career games played leaders
- URL: https://www.hockey-reference.com/leaders/games_played_career.html

Tasks: Scrape the NHL Leaders table
- Read the HTML page into R
- Grab the CSS selector for the table in the browser
- Write scraping code in R
- Perform the following data cleaning steps
  - Create an indicator variable for whether a player is in the Hall of Fame
  - Remove the asterisk (*) from the Player column
  - Remove the dot (.) from the Rank column

Read the HTML page into `R`

We use read_html() to read in the HTML page based on a specified webpage’s URL

library(rvest)
library(tidyverse)
nhl_url <- "https://www.hockey-reference.com/leaders/games_played_career.html"
nhl_url |> 
  read_html()

{html_document}
<html data-version="klecko-" data-root="/home/hr/build" lang="en" class="no-js">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="hr">\n<div id="wrap">\n  \n  <div id="header" role="banner"> ...

The next step is to extract the NHL Leaders table from the HTML. This can be accomplished by finding the CSS selector of the table.

How to find a table’s CSS selector

Move the cursor close to the table you want to scrape (e.g., near top of the table). Right click and select Inspect (Chrome or Firefox) or Inspect element (Safari)

The Developer Tools will open in your browser. Pay attention to the Elements pane (Chrome or Safari) or Inspector pane (Firefox)
- The HTML element corresponding to the webpage area (close to the table) mentioned in Step 1 is highlighted
- Hovering over different HTML elements will highlight different parts of the webpage

To find the CSS selector for the table, hover over different lines and stop at the line where only the entire table is highlighted
- This will often be a line with a <table> opening tag

Right click on the line, then choose Copy $\rightarrow$ Copy selector (Chrome) or Copy $\rightarrow$ CSS Selector (Firefox) or Copy $\rightarrow$ Selector Path (Safari)

Scraping HTML tables

The following video shows how to find the CSS selector (in Chrome)

Scraping HTML tables

Use html_element() to get the element associated with the CSS selector (table in this case)
Inside html_element(), specify the CSS selector that we copied earlier
This returns an HTML “node”

nhl_url |> 
  read_html() |> 
  html_element(css = "#stats_career_NHL")

{html_node}
<table class="suppress_glossary suppress_csv sortable stats_table" id="stats_career_NHL" data-cols-to-freeze="1,2">
[1] <caption>NHL Leaders Table</caption>
[2] <thead><tr>\n<th class="right">Rank</th>\n<th class="left">Player</th>\n< ...
[3] <tbody>\n<tr>\n<td class="right">1.</td>\n<td class="left"><a href="/play ...

Scraping HTML tables

Finally, use html_table() to convert to a tibble (data frame) in R
This completes our scraping process

nhl_tbl <- nhl_url |> 
  read_html() |> 
  html_element(css = "#stats_career_NHL") |> 
  html_table()

nhl_tbl

# A tibble: 250 × 4
   Rank  Player           Years      GP
   <chr> <chr>            <chr>   <int>
 1 1.    Patrick Marleau  1997-21  1779
 2 2.    Gordie Howe*     1946-80  1767
 3 3.    Mark Messier*    1979-04  1756
 4 4.    Jaromír Jágr     1990-18  1733
 5 5.    Ron Francis*     1981-04  1731
 6 6.    Joe Thornton     1997-22  1714
 7 7.    Zdeno Chára      1997-22  1680
 8 8.    Mark Recchi*     1988-11  1652
 9 9.    Chris Chelios*   1983-10  1651
10 10.   Dave Andreychuk* 1982-06  1639
# ℹ 240 more rows

Remarks: webpage elements

Note that html_table() only works when the element specified in html_element() is a table

There are other things to be extracted from an element
- To retrieve text from an element, use html_text2()
- To retrieve an attribute (e.g., hyperlink) from an element, use html_attr() and html_attrs()
- We will touch on these 2 cases later on

Not recommended

The inspection step (for obtaining CSS selector) can be skipped
html_table() can be called right after read_html()
This outputs a list of all the tables existed on the webpage

nhl_tbl_list <- nhl_url |> 
  read_html() |> 
  html_table()
nhl_tbl_list

[[1]]
# A tibble: 250 × 4
   Rank  Player           Years      GP
   <chr> <chr>            <chr>   <int>
 1 1.    Patrick Marleau  1997-21  1779
 2 2.    Gordie Howe*     1946-80  1767
 3 3.    Mark Messier*    1979-04  1756
 4 4.    Jaromír Jágr     1990-18  1733
 5 5.    Ron Francis*     1981-04  1731
 6 6.    Joe Thornton     1997-22  1714
 7 7.    Zdeno Chára      1997-22  1680
 8 8.    Mark Recchi*     1988-11  1652
 9 9.    Chris Chelios*   1983-10  1651
10 10.   Dave Andreychuk* 1982-06  1639
# ℹ 240 more rows

[[2]]
# A tibble: 50 × 4
   Rank  Player          Years      GP
   <chr> <chr>           <chr>   <int>
 1 1.    André Lacroix   1972-79   551
 2 2.    Ron Plumb       1972-79   549
 3 3.    Paul Shmyr      1972-79   511
 4 4.    Michel Parizeau 1972-79   509
 5 5.    Mike Antonovich 1972-79   486
 6 6.    Rick Ley        1972-79   478
 7 7.    John McKenzie   1972-79   477
 8 8.    Blair MacDonald 1973-79   476
 9 9.    Larry Pleau     1972-79   468
10 10.   Poul Popiel     1972-78   467
# ℹ 40 more rows

Not recommended

We can then subset out the desired table based on its index within the list

nhl_tbl_list[[1]]

# A tibble: 250 × 4
   Rank  Player           Years      GP
   <chr> <chr>            <chr>   <int>
 1 1.    Patrick Marleau  1997-21  1779
 2 2.    Gordie Howe*     1946-80  1767
 3 3.    Mark Messier*    1979-04  1756
 4 4.    Jaromír Jágr     1990-18  1733
 5 5.    Ron Francis*     1981-04  1731
 6 6.    Joe Thornton     1997-22  1714
 7 7.    Zdeno Chára      1997-22  1680
 8 8.    Mark Recchi*     1988-11  1652
 9 9.    Chris Chelios*   1983-10  1651
10 10.   Dave Andreychuk* 1982-06  1639
# ℹ 240 more rows

For this specific example, there are only 2 tables, so there doesn’t seem to be any issue. But what if there are a lot more than 2?

Data cleaning: working with `stringr`

The tidyverse offers stringr for string manipulation.

Check out the stringr cheatsheet.

The second page gives a neat overview of regular expressions (special patterns for string matching)
Note that some characters in an R string must be represented as special characters
- $ * + . ? [ ] ^ { } | ( ) \
- Use a double backslash (\\) ¹ to “escape” these characters (e.g., \\*)

`str_detect()`

Returns TRUE/FALSE, showing whether a string matches a specified pattern

str_detect("Gordie Howe*", "\\*")

[1] TRUE

str_detect("Gordie Howe", "\\*")

[1] FALSE

str_detect() can be used with filter() or within a conditional statement (e.g., with ifelse() or case_when())
- Recall that filter() subsets out rows that satisfy a condition

`str_detect()`

Suppose we want to keep only the HOF players. We can detect all rows with the asterisk (*) with str_detect().

nhl_tbl |> 
  filter(str_detect(Player, "\\*"))

# A tibble: 78 × 4
   Rank  Player            Years      GP
   <chr> <chr>             <chr>   <int>
 1 2.    Gordie Howe*      1946-80  1767
 2 3.    Mark Messier*     1979-04  1756
 3 5.    Ron Francis*      1981-04  1731
 4 8.    Mark Recchi*      1988-11  1652
 5 9.    Chris Chelios*    1983-10  1651
 6 10.   Dave Andreychuk*  1982-06  1639
 7 11.   Scott Stevens*    1982-04  1635
 8 12.   Larry Murphy*     1980-01  1615
 9 13.   Ray Bourque*      1979-01  1612
10 14.   Nicklas Lidström* 1991-12  1564
# ℹ 68 more rows

Back to the example…

Recall that one of the data cleaning task is to create an indicator variable for whether a player is in the HOF

nhl_tbl |> 
  mutate(HOF = ifelse(str_detect(Player, "\\*"), 1, 0))

# A tibble: 250 × 5
   Rank  Player           Years      GP   HOF
   <chr> <chr>            <chr>   <int> <dbl>
 1 1.    Patrick Marleau  1997-21  1779     0
 2 2.    Gordie Howe*     1946-80  1767     1
 3 3.    Mark Messier*    1979-04  1756     1
 4 4.    Jaromír Jágr     1990-18  1733     0
 5 5.    Ron Francis*     1981-04  1731     1
 6 6.    Joe Thornton     1997-22  1714     0
 7 7.    Zdeno Chára      1997-22  1680     0
 8 8.    Mark Recchi*     1988-11  1652     1
 9 9.    Chris Chelios*   1983-10  1651     1
10 10.   Dave Andreychuk* 1982-06  1639     1
# ℹ 240 more rows

`str_remove()`

str_remove() takes in a string and removes a specified pattern.

str_remove("Gordie Howe*", "\\*")

[1] "Gordie Howe"

str_remove() can be used with mutate()
- Recall that mutate() creates/modifies variables that are functions of existing variables

There’s a related function named str_remove_all(). The following code illustrates the difference.

str_remove("*Gordie* Howe*", "\\*")

[1] "Gordie* Howe*"

str_remove_all("*Gordie* Howe*", "\\*")

[1] "Gordie Howe"

Back to the example…

Now we build upon the previous code to finish the data cleaning process.
Recall than we want to remove the asterisk (*) and dot (.) from the Player and Rank columns, respectively.

nhl_tbl_cleaned <- nhl_tbl |> 
  mutate(HOF = ifelse(str_detect(Player, "\\*"), 1, 0),
         Player = str_remove(Player, "\\*"),
         Rank = str_remove(Rank, "\\."))

nhl_tbl_cleaned

# A tibble: 250 × 5
   Rank  Player          Years      GP   HOF
   <chr> <chr>           <chr>   <int> <dbl>
 1 1     Patrick Marleau 1997-21  1779     0
 2 2     Gordie Howe     1946-80  1767     1
 3 3     Mark Messier    1979-04  1756     1
 4 4     Jaromír Jágr    1990-18  1733     0
 5 5     Ron Francis     1981-04  1731     1
 6 6     Joe Thornton    1997-22  1714     0
 7 7     Zdeno Chára     1997-22  1680     0
 8 8     Mark Recchi     1988-11  1652     1
 9 9     Chris Chelios   1983-10  1651     1
10 10    Dave Andreychuk 1982-06  1639     1
# ℹ 240 more rows

Scraping practice

Example: Frauen Bundesliga (German women’s soccer league)

URL: https://fbref.com/en/comps/183/2017-2018/2017-2018-Frauen-Bundesliga-Stats
The link above provides stats for the 2017-2018 season
- Scrape the Overall table under Regular season
- (Time permitting) Write a general function for scraping data for any specified season.
  - Hint: change the years in the URL and CSS selector
  - Get data for every season between 2016-2017 and 2019-2020 and combine them into a single table

Scraping practice

Scrape 2017-2018 overall standings

fb_url <- "https://fbref.com/en/comps/183/2017-2018/2017-2018-Frauen-Bundesliga-Stats"
fb_tbl <- fb_url |>
  read_html() |>
  html_element(css = "#results2017-20181831_overall") |>
  html_table()
fb_tbl

# A tibble: 12 × 15
      Rk Squad             MP     W     D     L    GF    GA    GD   Pts `Pts/MP`
   <int> <chr>          <int> <int> <int> <int> <int> <int> <int> <int>    <dbl>
 1     1 Wolfsburg         22    18     2     2    56     8    48    56     2.55
 2     2 Bayern Munich     22    17     2     3    62    15    47    53     2.41
 3     3 Freiburg          22    15     3     4    50    15    35    48     2.18
 4     4 Turbine Potsd…    22    13     6     3    50    21    29    45     2.05
 5     5 Essen             22    12     3     7    43    30    13    39     1.77
 6     6 FFC Frankfurt     22    10     1    11    29    25     4    31     1.41
 7     7 Sand              22     9     3    10    32    34    -2    30     1.36
 8     8 Hoffenheim        22     8     1    13    22    32   -10    25     1.14
 9     9 MSV Duisburg      22     6     0    16    16    33   -17    18     0.82
10    10 Werder Bremen     22     3     5    14    26    59   -33    14     0.64
11    11 Köln              22     3     2    17     8    78   -70    11     0.5 
12    12 USV Jena          22     2     4    16    12    56   -44    10     0.45
# ℹ 4 more variables: Attendance <chr>, `Top Team Scorer` <chr>,
#   Goalkeeper <chr>, Notes <chr>

Scraping practice

General function

get_fb_data <- function(start_year) {

  year_str <- str_c(start_year, start_year + 1, sep = "-")

  fb_url <- str_c("https://fbref.com/en/comps/183/", year_str, "/", year_str, "-Frauen-Bundesliga-Stats")

  year_css <- str_c("#results", year_str, "1831_overall")

  fb_tbl <- fb_url |>
    read_html() |>
    html_element(css = year_css) |>
    html_table() |>
    mutate(season = year_str)

  return(fb_tbl)
}

seasons <- 2016:2019
fb_tbl_full <- seasons |>
  map(get_fb_data) |>
  list_rbind()
# fb_tbl_full

Scraping links and images

Continuing with the Frauen Bundesliga 2017-2018 season stats example…

Notice that the table on the webpage also contains team logos, URLs, etc.
These information were not extracted with html_table()
Suppose we’re also interested in getting the team URLs and logos

Scraping links and images

We first store the HTML node for the table in an object (i.e., everything up to html_element() with a specified CSS selector for the table)
This can then be used to obtain the images and team links based on their tags.

fb_url <- "https://fbref.com/en/comps/183/2017-2018/2017-2018-Frauen-Bundesliga-Stats"
fb_node <- fb_url |> 
  read_html() |> 
  html_element(css = "#results2017-20181831_overall")
fb_node

{html_node}
<table class="stats_table sortable min_width force_mobilize" id="results2017-20181831_overall" data-cols-to-freeze=",2">
[1] <caption>Regular season Table</caption>
[2] <colgroup>\n<col>\n<col>\n<col>\n<col>\n<col>\n<col>\n<col>\n<col>\n<col> ...
[3] <thead><tr>\n<th aria-label="Rank" data-stat="rank" scope="col" class=" p ...
[4] <tbody>\n<tr>\n<th scope="row" class="right qualifier qualification_indic ...

Scraping links and images

To get all image elements, we can use html_elements() and specify the img tag
- html_elements() is the “plural” version of html_element(), since we want ALL image elements, not just one (Honestly, if you don’t remember the differences, just try both and see which one is suitable)

fb_node |> 
  html_elements("img")

{xml_nodeset (12)}
 [1] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
 [2] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
 [3] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
 [4] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
 [5] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
 [6] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
 [7] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
 [8] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
 [9] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
[10] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
[11] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
[12] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...

Notice that each image contains different attributes such as height, width, image path (src)

Scraping links and images

Suppose we want to grab the image path only (for future plotting purpose), we can use html_attr() to get the src attribute

fb_imgs <- fb_node |> 
  html_elements("img") |> 
  html_attr("src")
fb_imgs

 [1] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.a1393014.png"
 [2] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.51ec22be.png"
 [3] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.b4de690d.png"
 [4] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.de550500.png"
 [5] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.becc1dd0.png"
 [6] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.77d2e598.png"
 [7] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.0cc34cf4.png"
 [8] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.87705c62.png"
 [9] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.0580d9a9.png"
[10] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.7adbf480.png"
[11] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.88ddc98e.png"
[12] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.765472c9.png"

Scraping links and images

To get all the URLs in the Squad column, we can first use html_elements() again and specify the "a" tag
- The <a> (anchor) tag defines a hyperlink

fb_node |> 
  html_elements("a")

{xml_nodeset (38)}
 [1] <a href="/en/squads/a1393014/2017-2018/Wolfsburg-Women-Stats">Wolfsburg</a>
 [2] <a href="/en/players/363b99a4/Pernille-Harder">Pernille Harder</a>
 [3] <a href="/en/players/992b30a1/Almuth-Schult">Almuth Schult</a>
 [4] <a href="/en/squads/51ec22be/2017-2018/Bayern-Munich-Women-Stats">Bayern ...
 [5] <a href="/en/players/6862731d/Fridolina-Rolfo">Fridolina Rolfö</a>
 [6] <a href="/en/players/c7dc2a33/Manuela-Zinsberger">Manuela Zinsberger</a>
 [7] <a href="/en/squads/b4de690d/2017-2018/Freiburg-Women-Stats">Freiburg</a>
 [8] <a href="/en/players/3cd04ba1/Lina-Magull">Lina Magull</a>
 [9] <a href="/en/players/82c4f339/Laura-Benkarth">Laura Benkarth</a>
[10] <a href="/en/squads/de550500/2017-2018/Turbine-Potsdam-Stats">Turbine Po ...
[11] <a href="/en/players/8b5f141c/Svenja-Huth">Svenja Huth</a>
[12] <a href="/en/players/38bbb38c/Lisa-Schmitz">Lisa Schmitz</a>
[13] <a href="/en/squads/becc1dd0/2017-2018/Essen-Stats">Essen</a>
[14] <a href="/en/players/5a20e7f0/Linda-Dallmann">Linda Dallmann</a>
[15] <a href="/en/players/8699d87d/Lisa-Weiss">Lisa Weiß</a>
[16] <a href="/en/squads/77d2e598/2017-2018/FFC-Frankfurt-Stats">FFC Frankfur ...
[17] <a href="/en/players/6698c9f0/Jackie-Groenen">Jackie Groenen</a>
[18] <a href="/en/players/0dfe3e98/Bryane-Heaberlin">Bryane Heaberlin</a>
[19] <a href="/en/squads/0cc34cf4/2017-2018/Sand-Stats">Sand</a>
[20] <a href="/en/players/0f55e5ea/Nina-Burger">Nina Burger</a>
...

Scraping links and images

Within <a>, we can grab the href attribute with html_attr()
- href indicates the URL/page associated with the link

fb_node |> 
  html_elements("a") |> 
  html_attr("href")

 [1] "/en/squads/a1393014/2017-2018/Wolfsburg-Women-Stats"    
 [2] "/en/players/363b99a4/Pernille-Harder"                   
 [3] "/en/players/992b30a1/Almuth-Schult"                     
 [4] "/en/squads/51ec22be/2017-2018/Bayern-Munich-Women-Stats"
 [5] "/en/players/6862731d/Fridolina-Rolfo"                   
 [6] "/en/players/c7dc2a33/Manuela-Zinsberger"                
 [7] "/en/squads/b4de690d/2017-2018/Freiburg-Women-Stats"     
 [8] "/en/players/3cd04ba1/Lina-Magull"                       
 [9] "/en/players/82c4f339/Laura-Benkarth"                    
[10] "/en/squads/de550500/2017-2018/Turbine-Potsdam-Stats"    
[11] "/en/players/8b5f141c/Svenja-Huth"                       
[12] "/en/players/38bbb38c/Lisa-Schmitz"                      
[13] "/en/squads/becc1dd0/2017-2018/Essen-Stats"              
[14] "/en/players/5a20e7f0/Linda-Dallmann"                    
[15] "/en/players/8699d87d/Lisa-Weiss"                        
[16] "/en/squads/77d2e598/2017-2018/FFC-Frankfurt-Stats"      
[17] "/en/players/6698c9f0/Jackie-Groenen"                    
[18] "/en/players/0dfe3e98/Bryane-Heaberlin"                  
[19] "/en/squads/0cc34cf4/2017-2018/Sand-Stats"               
[20] "/en/players/0f55e5ea/Nina-Burger"                       
[21] "/en/players/9474cd93/Carina-Schluter"                   
[22] "/en/squads/87705c62/2017-2018/Hoffenheim-Women-Stats"   
[23] "/en/players/9e30ae90/Isabella-Hartig"                   
[24] "/en/players/277cdd4e/Tabea-Wassmuth"                    
[25] "/en/players/7c51bea4/Friederike-Abt"                    
[26] "/en/squads/0580d9a9/2017-2018/MSV-Duisburg-Women-Stats" 
[27] "/en/players/88be98d1/Kathleen-Radtke"                   
[28] "/en/players/781a82de/Lena-Nuding"                       
[29] "/en/squads/7adbf480/2017-2018/Werder-Bremen-Women-Stats"
[30] "/en/players/e070cdf6/Nina-Luhrssen"                     
[31] "/en/players/e18dc3a3/Nora-Clausen"                      
[32] "/en/players/37469b31/Anneke-Borbe"                      
[33] "/en/squads/88ddc98e/2017-2018/Koln-Women-Stats"         
[34] "/en/players/998749f3/Amber-Hearn"                       
[35] "/en/players/9abeb65b/Anne-Kathrine-Kremer"              
[36] "/en/squads/765472c9/2017-2018/USV-Jena-Stats"           
[37] "/en/players/b004aab5/Amelia-Pietrangelo"                
[38] "/en/players/90c69bf5/Justien-Odeurs"

Scraping links and images

Notice that the previous output is a vector with all hyperlinks in the table, including all squads and players
Since we want squads only, we need to subset out all strings with the keyword "squads"
The function str_subset() comes in handy here
- str_subset() returns only the vector elements that matches a pattern

fb_links <- fb_node |> 
  html_elements("a") |> 
  html_attr("href") |> 
  str_subset("squads")
fb_links

 [1] "/en/squads/a1393014/2017-2018/Wolfsburg-Women-Stats"    
 [2] "/en/squads/51ec22be/2017-2018/Bayern-Munich-Women-Stats"
 [3] "/en/squads/b4de690d/2017-2018/Freiburg-Women-Stats"     
 [4] "/en/squads/de550500/2017-2018/Turbine-Potsdam-Stats"    
 [5] "/en/squads/becc1dd0/2017-2018/Essen-Stats"              
 [6] "/en/squads/77d2e598/2017-2018/FFC-Frankfurt-Stats"      
 [7] "/en/squads/0cc34cf4/2017-2018/Sand-Stats"               
 [8] "/en/squads/87705c62/2017-2018/Hoffenheim-Women-Stats"   
 [9] "/en/squads/0580d9a9/2017-2018/MSV-Duisburg-Women-Stats" 
[10] "/en/squads/7adbf480/2017-2018/Werder-Bremen-Women-Stats"
[11] "/en/squads/88ddc98e/2017-2018/Koln-Women-Stats"         
[12] "/en/squads/765472c9/2017-2018/USV-Jena-Stats"

Scraping links and images

Finally, we can add the team images and links as two new columns in our table

fb_tbl <- fb_node |> 
  html_table() |> 
  mutate(img = fb_imgs,
         link = fb_links)

Scraping links and images

Let’s make a scatterplot of total number of goals scored (GF - goals for) and goals conceded (GA - goals against), and display the team logos.

library(ggimage)
fb_tbl |>
  mutate(img = str_remove(img, "mini.")) |> 
  ggplot(aes(GA, GF)) +
  geom_image(aes(image = img), size = 0.08, asp = 1) +
  theme_classic()

Scraping text

As previously mentioned, data do not always come in the form of nicely formatted tables
Sometimes data are just simply raw text
Example: Wimbledon Women’s singles
- URL: https://en.wikipedia.org/wiki/2009_Wimbledon_Championships_-_Women's_singles
- Suppose we want to scrape results for the seeded players (under Seeds section)

Scraping text

Just as before, after reading in the page, we inspect and grab the CSS selector for only this blurb of text

wimbledon_url <- "https://en.wikipedia.org/wiki/2009_Wimbledon_Championships_-_Women's_singles" 
wimbledon_url |> 
  read_html() |> 
  html_element("#mw-content-text > div.mw-parser-output > div:nth-child(13)")

{html_node}
<div class="div-col">
[1] <dl>\n<dd>\n<span style="visibility:hidden;color:transparent;">0</span><a ...
[2] <dl>\n<dd>\n<a href="#Section_1">17</a>.   <span class="flagicon"><span c ...

Scraping text

Then, we can retrieve text from this element with html_text2()

wimbledon_url |> 
  read_html() |> 
  html_element("#mw-content-text > div.mw-parser-output > div:nth-child(13)") |> 
  html_text2()

[1] "01. Dinara Safina (semifinals)\n02. Serena Williams (champion)\n03. Venus Williams (final)\n04. Elena Dementieva (semifinals)\n05. Svetlana Kuznetsova (third round)\n06. Jelena Janković (third round)\n07. Vera Zvonareva (third round, withdrew due to an ankle injury)\n08. Victoria Azarenka (quarterfinals)\n09. Caroline Wozniacki (fourth round)\n10. Nadia Petrova (fourth round)\n11. Agnieszka Radwańska (quarterfinals)\n12. Marion Bartoli (third round)\n13. Ana Ivanovic (fourth round, retired due to a thigh injury)\n14. Dominika Cibulková (third round)\n15. Flavia Pennetta (third round)\n16. Zheng Jie (second round)\n17. Amélie Mauresmo (fourth round)\n18. Samantha Stosur (third round)\n19. Li Na (third round)\n20. Anabel Medina Garrigues (third round)\n21. Patty Schnyder (first round)\n22. Alizé Cornet (first round)\n23. Aleksandra Wozniak (first round)\n24. Maria Sharapova (second round)\n25. Kaia Kanepi (first round)\n26. Virginie Razzano (fourth round)\n27. Alisa Kleybanova (second round)\n28. Sorana Cîrstea (third round)\n29. Sybille Bammer (first round)\n30. Ágnes Szávay (first round)\n31. Anastasia Pavlyuchenkova (second round)\n32. Anna Chakvetadze (first round)"

Scraping text

Notice this output a single string of all the text
Each combination of seed-player-result is separated by a newline character \n
There are many ways you can do to separate these - one way is with str_split_1()
- str_split_1() splits a single string into a character vector based on a pattern
- Other ways include: str_split() then unlist(), or read_lines(), and many more

Scraping text

wimbledon_info <- wimbledon_url |> 
  read_html() |> 
  html_element("#mw-content-text > div.mw-parser-output > div:nth-child(13)") |> 
  html_text2() |>
  str_split_1("\\n")
wimbledon_info

 [1] "01. Dinara Safina (semifinals)"                                   
 [2] "02. Serena Williams (champion)"                                   
 [3] "03. Venus Williams (final)"                                       
 [4] "04. Elena Dementieva (semifinals)"                                
 [5] "05. Svetlana Kuznetsova (third round)"                            
 [6] "06. Jelena Janković (third round)"                                
 [7] "07. Vera Zvonareva (third round, withdrew due to an ankle injury)"
 [8] "08. Victoria Azarenka (quarterfinals)"                            
 [9] "09. Caroline Wozniacki (fourth round)"                            
[10] "10. Nadia Petrova (fourth round)"                                 
[11] "11. Agnieszka Radwańska (quarterfinals)"                          
[12] "12. Marion Bartoli (third round)"                                 
[13] "13. Ana Ivanovic (fourth round, retired due to a thigh injury)"   
[14] "14. Dominika Cibulková (third round)"                             
[15] "15. Flavia Pennetta (third round)"                                
[16] "16. Zheng Jie (second round)"                                     
[17] "17. Amélie Mauresmo (fourth round)"                               
[18] "18. Samantha Stosur (third round)"                                
[19] "19. Li Na (third round)"                                          
[20] "20. Anabel Medina Garrigues (third round)"                        
[21] "21. Patty Schnyder (first round)"                                 
[22] "22. Alizé Cornet (first round)"                                   
[23] "23. Aleksandra Wozniak (first round)"                             
[24] "24. Maria Sharapova (second round)"                               
[25] "25. Kaia Kanepi (first round)"                                    
[26] "26. Virginie Razzano (fourth round)"                              
[27] "27. Alisa Kleybanova (second round)"                              
[28] "28. Sorana Cîrstea (third round)"                                 
[29] "29. Sybille Bammer (first round)"                                 
[30] "30. Ágnes Szávay (first round)"                                   
[31] "31. Anastasia Pavlyuchenkova (second round)"                      
[32] "32. Anna Chakvetadze (first round)"

Scraping text

As a final step, you can try to turn the vector into a table (as a single column), then clean up and create 3 columns: seed, player, and result

(This might involve tasks like extracting text between parentheses, locating special characters like . ( ), etc. — check out this blog post)

(Web) API basics

An API (Application Programming Interface) connects computer programs to each other

Web APIs provide interactions between a client device and a web server using the Hypertext Transfer Protocol (HTTP)

Clients send a (HTTP) request and receive a response (in JSON or XML)
Many organizations have their own public API, which can be used to access data

Fortunately, there exists many R packages (sports and non-sports) that provides access to APIs for obtaining data
- Note that these packages (“API wrappers”) do not provide the actual data - instead functions for accessing the data
- For sports, check out the Sports Analytics CRAN Task View and SportsDataverse for more information

The `httr` package

httr offers a general way of getting data from an API, via different tools for working with HTTP

GET() sends a request to an API and captures the response
content() extracts out the data from the response

These 2 functions are illustrated in the next example

There are many other useful functions in httr

For example, PUT() and POST() can be used to send data to APIs
Other popular verbs are PATCH(), HEAD(), and DELETE()

Pulling data from APIs

Example: Formula One API (Inspiration: Tidy Tuesday 2021-09-07)

Ergast Developer API (http://ergast.com/mrd/) is an (experimental) API that provides Formula One historical data
Suppose we’re interested in getting a table of every F1 winning constructor
- URL: http://ergast.com/api/f1/constructorStandings/1/constructors.json

Pulling data from APIs

First, we can use GET() to send a request to the API. We then receive the data via a response

library(httr)
f1_api <- "http://ergast.com/api/f1/constructorStandings/1/constructors.json"
f1_response <- f1_api |> 
  GET()
f1_response

Response [http://ergast.com/api/f1/constructorStandings/1/constructors.json]
  Date: 2023-07-20 14:18
  Status: 200
  Content-Type: application/json; charset=utf-8
  Size: 2.4 kB

# check the type of the response object and whether we get an error
# http_type(f1_response)
# http_error(f1_response)

Pulling data from APIs

Next, we want to get the data from the response by calling content(). We can then view the structure of the content object.

f1_content <- f1_response |>   
  content()
glimpse(f1_content)

List of 1
 $ MRData:List of 7
  ..$ xmlns           : chr "http://ergast.com/mrd/1.5"
  ..$ series          : chr "f1"
  ..$ url             : chr "http://ergast.com/api/f1/constructorstandings/1/constructors.json"
  ..$ limit           : chr "30"
  ..$ offset          : chr "0"
  ..$ total           : chr "17"
  ..$ ConstructorTable:List of 2
  .. ..$ constructorStandings: chr "1"
  .. ..$ Constructors        :List of 17

Pulling data from APIs

Finally, based on the content structure, we can get a list of constructors. Each list consists of constructor ID, URL, name, and nationality.

f1_constructor_list <- f1_content |> 
  pluck("MRData") |> 
  pluck("ConstructorTable") |> 
  pluck("Constructors")
# f1_constructor_list
f1_constructor_list[[1]]

$constructorId
[1] "benetton"

$url
[1] "http://en.wikipedia.org/wiki/Benetton_Formula"

$name
[1] "Benetton"

$nationality
[1] "Italian"

Pulling data from APIs

A few extra transformation steps will give us the desired table

f1_constructor_tbl <- f1_constructor_list |> 
  as_tibble_col(column_name = "info") |> # convert list to tibble
  unnest_wider(info) # unnest a list-column into columns
f1_constructor_tbl

# A tibble: 17 × 4
   constructorId url                                           name  nationality
   <chr>         <chr>                                         <chr> <chr>      
 1 benetton      http://en.wikipedia.org/wiki/Benetton_Formula Bene… Italian    
 2 brabham-repco http://en.wikipedia.org/wiki/Brabham          Brab… British    
 3 brawn         http://en.wikipedia.org/wiki/Brawn_GP         Brawn British    
 4 brm           http://en.wikipedia.org/wiki/BRM              BRM   British    
 5 cooper-climax http://en.wikipedia.org/wiki/Cooper_Car_Comp… Coop… British    
 6 ferrari       http://en.wikipedia.org/wiki/Scuderia_Ferrari Ferr… Italian    
 7 lotus-climax  http://en.wikipedia.org/wiki/Team_Lotus       Lotu… British    
 8 lotus-ford    http://en.wikipedia.org/wiki/Team_Lotus       Lotu… British    
 9 matra-ford    http://en.wikipedia.org/wiki/Matra            Matr… French     
10 mclaren       http://en.wikipedia.org/wiki/McLaren          McLa… British    
11 mercedes      http://en.wikipedia.org/wiki/Mercedes-Benz_i… Merc… German     
12 red_bull      http://en.wikipedia.org/wiki/Red_Bull_Racing  Red … Austrian   
13 renault       http://en.wikipedia.org/wiki/Renault_in_Form… Rena… French     
14 team_lotus    http://en.wikipedia.org/wiki/Team_Lotus       Team… British    
15 tyrrell       http://en.wikipedia.org/wiki/Tyrrell_Racing   Tyrr… British    
16 vanwall       http://en.wikipedia.org/wiki/Vanwall          Vanw… British    
17 williams      http://en.wikipedia.org/wiki/Williams_Grand_… Will… British

Scrape responsibly!

Great article on Ethics in Web Scraping, featuring a “web scraping manifesto”
Web scraping case study from the Data science ethics chapter of MDSR
Good practice chapter from Web Scraping using R
Scraping ethics and legalities section from Web scraping chapter of R4DS (2e)
Common points
- Be mindful of the terms of use of every website
- Anonymize personal data, especially if data/analysis are to be publicly released
- Take advantage of APIs
- Only scrape what you need

The `polite` package: overview

install.packages("polite")

polite ensures that you’re respecting the robots.txt ¹ and not submitting too many requests

bow() introduces the user to the host and asks for scraping permission
scrape() scrapes and retrieves data

(Sometimes, nod() is required as an intermediate step, to agree modification of session path with host)

The `polite` package: example

Example: Wimbledon Women’s singles (same as before)
First, pass the URL into bow() to get a “session” object
- This gives information about the robots.txt and whether the webpage is scrapable

library(polite)
wimbledon_url <- "https://en.wikipedia.org/wiki/2009_Wimbledon_Championships_-_Women's_singles"
session <- wimbledon_url |> 
  bow()
session

<polite session> https://en.wikipedia.org/wiki/2009_Wimbledon_Championships_-_Women's_singles
    User-agent: polite R package
    robots.txt: 456 rules are defined for 33 bots
   Crawl delay: 5 sec
  The path is scrapable for this user-agent

The `polite` package: example

Now, use scrape() to get the data from the session previously created by bow()
- This essentially replaces read_html() as seen earlier

session |> 
  scrape()

{html_document}
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="skin-vector skin-vector-search-vue mediawiki ltr sitedir-ltr ...

The `polite` package: example

The remaining steps are similar as before. We can use the same code as earlier for selecting the HTML element and retrieving text.

session |> 
  scrape() |> 
  html_element("#mw-content-text > div.mw-parser-output > div:nth-child(13)") |> 
  html_text2() |>
  str_split_1("\\n")

 [1] "01. Dinara Safina (semifinals)"                                   
 [2] "02. Serena Williams (champion)"                                   
 [3] "03. Venus Williams (final)"                                       
 [4] "04. Elena Dementieva (semifinals)"                                
 [5] "05. Svetlana Kuznetsova (third round)"                            
 [6] "06. Jelena Janković (third round)"                                
 [7] "07. Vera Zvonareva (third round, withdrew due to an ankle injury)"
 [8] "08. Victoria Azarenka (quarterfinals)"                            
 [9] "09. Caroline Wozniacki (fourth round)"                            
[10] "10. Nadia Petrova (fourth round)"                                 
[11] "11. Agnieszka Radwańska (quarterfinals)"                          
[12] "12. Marion Bartoli (third round)"                                 
[13] "13. Ana Ivanovic (fourth round, retired due to a thigh injury)"   
[14] "14. Dominika Cibulková (third round)"                             
[15] "15. Flavia Pennetta (third round)"                                
[16] "16. Zheng Jie (second round)"                                     
[17] "17. Amélie Mauresmo (fourth round)"                               
[18] "18. Samantha Stosur (third round)"                                
[19] "19. Li Na (third round)"                                          
[20] "20. Anabel Medina Garrigues (third round)"                        
[21] "21. Patty Schnyder (first round)"                                 
[22] "22. Alizé Cornet (first round)"                                   
[23] "23. Aleksandra Wozniak (first round)"                             
[24] "24. Maria Sharapova (second round)"                               
[25] "25. Kaia Kanepi (first round)"                                    
[26] "26. Virginie Razzano (fourth round)"                              
[27] "27. Alisa Kleybanova (second round)"                              
[28] "28. Sorana Cîrstea (third round)"                                 
[29] "29. Sybille Bammer (first round)"                                 
[30] "30. Ágnes Szávay (first round)"                                   
[31] "31. Anastasia Pavlyuchenkova (second round)"                      
[32] "32. Anna Chakvetadze (first round)"

More resources

polite package page (more examples, featuring a template for package developers)
Intro to {polite} Web Scraping of Soccer Data with R
Web scraping workshop from UCSAS 2020 and 2021
Scraping with Selenium blogpost
Browse through source code of different R “scraper” packages

Final words

Web scraping is an excellent means for gaining proficiency in data cleaning
- It takes time - the more you play around the better you get
- Inspect the output at each step
- Consult the help documentations
Come up with fun personal projects (data viz, Shiny app, etc.), scrape data, and enjoy learning (and the struggle)
You can develop the next great sports “scrapR”¹ package(s) (or even for your field of interest)

Web Scraping: A Primer

First, a moment of appreciation

Overview

Material credits

Webpage basics

Web scraping in R

Typical web scraping workflow

And make sure to…

Scraping HTML tables

Read the HTML page into R

How to find a table’s CSS selector

Scraping HTML tables

Scraping HTML tables

Scraping HTML tables

Remarks: webpage elements

Not recommended

Not recommended

Data cleaning: working with stringr

str_detect()

str_detect()

Back to the example…

str_remove()

Back to the example…

Scraping practice

Scraping practice

Scraping practice

Scraping links and images

Scraping links and images

Scraping links and images

Scraping links and images

Scraping links and images

Scraping links and images

Scraping links and images

Scraping links and images

Scraping links and images

Scraping text

Scraping text

Scraping text

Scraping text

Scraping text

Scraping text

(Web) API basics

The httr package

Pulling data from APIs

Pulling data from APIs

Pulling data from APIs

Pulling data from APIs

Pulling data from APIs

Scrape responsibly!

The polite package: overview

The polite package: example

The polite package: example

The polite package: example

More resources

Final words

Web scraping in `R`

Read the HTML page into `R`

Data cleaning: working with `stringr`

`str_detect()`

`str_detect()`

`str_remove()`

The `httr` package

The `polite` package: overview

The `polite` package: example

The `polite` package: example

The `polite` package: example