
Department of Statistics & Data Science
Carnegie Mellon University
SURE 2023 - CMSACamp
@qntkhvn qntkhvn qntkhvn.netlify.app
Slides: qntkhvn.github.io/webscraping
Sports analytics today would be so much different without web scraping
Publicly available data
Reproducible research
A breakthrough…
Goal: you should be able to use R to scrape an HTML table from the web after this lecture, and hopefully more.
Agenda:
Webpage basics
Web scraping with rvest
stringrAPIs
Responsible web scraping
Best practices
Featuring polite
Data Wrangling lecture notes from the Stanford Data Challenge Lab (DCL) course
I highly recommend the Web scraping chapter in R4DS (2e) for a neat basic overview of webpage structure and web scraping.
HTML (Hyper Text Markup Language) defines the content and structure of a webpage
An HTML page contains various elements (headers, paragraphs, tables…)
HTML tags define where an element begins and ends
Opening and closing tags have the forms <tagname> and </tagname> (e.g., <table> and </table>)
CSS (Cascading Style Sheets) defines the appearance of HTML elements (i.e. whether the webpage is pretty or ugly)
CSS selectors are patterns used to select the elements to be style
CSS selectors can be used to extract elements from a webpage
RTwo widely-used packages:
rvest: simple, tidyverse friendly, static data
RSelenium: more advanced, dynamic data
We will focus on rvest today
(For the python fans, you can do it with Beautiful Soup and Selenium)
Data organization and cleaning (most of the process)
Extracting elements (e.g., tables, links, etc.)
Common data manipulation tasks (e.g., with dplyr)
Handle strings (e.g, stringr, stringi) and regular expressions (regex)
Generalization: write functions (and develop packages)
Inspect the output at each step
Consult the help documentations
Example: NHL career games played leaders
Tasks: Scrape the NHL Leaders table
Read the HTML page into R
Grab the CSS selector for the table in the browser
Write scraping code in R
Perform the following data cleaning steps
Create an indicator variable for whether a player is in the Hall of Fame
Remove the asterisk (*) from the Player column
Remove the dot (.) from the Rank column
Rread_html() to read in the HTML page based on a specified webpage’s URLlibrary(rvest)
library(tidyverse)
nhl_url <- "https://www.hockey-reference.com/leaders/games_played_career.html"
nhl_url |>
read_html(){html_document}
<html data-version="klecko-" data-root="/home/hr/build" lang="en" class="no-js">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="hr">\n<div id="wrap">\n \n <div id="header" role="banner"> ...
The Developer Tools will open in your browser. Pay attention to the Elements pane (Chrome or Safari) or Inspector pane (Firefox)
The HTML element corresponding to the webpage area (close to the table) mentioned in Step 1 is highlighted
Hovering over different HTML elements will highlight different parts of the webpage
To find the CSS selector for the table, hover over different lines and stop at the line where only the entire table is highlighted
<table> opening tagThe following video shows how to find the CSS selector (in Chrome)
Use html_element() to get the element associated with the CSS selector (table in this case)
Inside html_element(), specify the CSS selector that we copied earlier
This returns an HTML “node”
{html_node}
<table class="suppress_glossary suppress_csv sortable stats_table" id="stats_career_NHL" data-cols-to-freeze="1,2">
[1] <caption>NHL Leaders Table</caption>
[2] <thead><tr>\n<th class="right">Rank</th>\n<th class="left">Player</th>\n< ...
[3] <tbody>\n<tr>\n<td class="right">1.</td>\n<td class="left"><a href="/play ...
Finally, use html_table() to convert to a tibble (data frame) in R
This completes our scraping process
nhl_tbl <- nhl_url |>
read_html() |>
html_element(css = "#stats_career_NHL") |>
html_table()
nhl_tbl# A tibble: 250 × 4
Rank Player Years GP
<chr> <chr> <chr> <int>
1 1. Patrick Marleau 1997-21 1779
2 2. Gordie Howe* 1946-80 1767
3 3. Mark Messier* 1979-04 1756
4 4. Jaromír Jágr 1990-18 1733
5 5. Ron Francis* 1981-04 1731
6 6. Joe Thornton 1997-22 1714
7 7. Zdeno Chára 1997-22 1680
8 8. Mark Recchi* 1988-11 1652
9 9. Chris Chelios* 1983-10 1651
10 10. Dave Andreychuk* 1982-06 1639
# ℹ 240 more rows
html_table() only works when the element specified in html_element() is a tableThere are other things to be extracted from an element
To retrieve text from an element, use html_text2()
To retrieve an attribute (e.g., hyperlink) from an element, use html_attr() and html_attrs()
We will touch on these 2 cases later on
The inspection step (for obtaining CSS selector) can be skipped
html_table() can be called right after read_html()
This outputs a list of all the tables existed on the webpage
[[1]]
# A tibble: 250 × 4
Rank Player Years GP
<chr> <chr> <chr> <int>
1 1. Patrick Marleau 1997-21 1779
2 2. Gordie Howe* 1946-80 1767
3 3. Mark Messier* 1979-04 1756
4 4. Jaromír Jágr 1990-18 1733
5 5. Ron Francis* 1981-04 1731
6 6. Joe Thornton 1997-22 1714
7 7. Zdeno Chára 1997-22 1680
8 8. Mark Recchi* 1988-11 1652
9 9. Chris Chelios* 1983-10 1651
10 10. Dave Andreychuk* 1982-06 1639
# ℹ 240 more rows
[[2]]
# A tibble: 50 × 4
Rank Player Years GP
<chr> <chr> <chr> <int>
1 1. André Lacroix 1972-79 551
2 2. Ron Plumb 1972-79 549
3 3. Paul Shmyr 1972-79 511
4 4. Michel Parizeau 1972-79 509
5 5. Mike Antonovich 1972-79 486
6 6. Rick Ley 1972-79 478
7 7. John McKenzie 1972-79 477
8 8. Blair MacDonald 1973-79 476
9 9. Larry Pleau 1972-79 468
10 10. Poul Popiel 1972-78 467
# ℹ 40 more rows
# A tibble: 250 × 4
Rank Player Years GP
<chr> <chr> <chr> <int>
1 1. Patrick Marleau 1997-21 1779
2 2. Gordie Howe* 1946-80 1767
3 3. Mark Messier* 1979-04 1756
4 4. Jaromír Jágr 1990-18 1733
5 5. Ron Francis* 1981-04 1731
6 6. Joe Thornton 1997-22 1714
7 7. Zdeno Chára 1997-22 1680
8 8. Mark Recchi* 1988-11 1652
9 9. Chris Chelios* 1983-10 1651
10 10. Dave Andreychuk* 1982-06 1639
# ℹ 240 more rows
stringrThe tidyverse offers stringr for string manipulation.
Check out the stringr cheatsheet.
The second page gives a neat overview of regular expressions (special patterns for string matching)
Note that some characters in an R string must be represented as special characters
$ * + . ? [ ] ^ { } | ( ) \
Use a double backslash (\\) 1 to “escape” these characters (e.g., \\*)
str_detect()str_detect() can be used with filter() or within a conditional statement (e.g., with ifelse() or case_when())
filter() subsets out rows that satisfy a conditionstr_detect()Suppose we want to keep only the HOF players. We can detect all rows with the asterisk (*) with str_detect().
# A tibble: 78 × 4
Rank Player Years GP
<chr> <chr> <chr> <int>
1 2. Gordie Howe* 1946-80 1767
2 3. Mark Messier* 1979-04 1756
3 5. Ron Francis* 1981-04 1731
4 8. Mark Recchi* 1988-11 1652
5 9. Chris Chelios* 1983-10 1651
6 10. Dave Andreychuk* 1982-06 1639
7 11. Scott Stevens* 1982-04 1635
8 12. Larry Murphy* 1980-01 1615
9 13. Ray Bourque* 1979-01 1612
10 14. Nicklas Lidström* 1991-12 1564
# ℹ 68 more rows
Recall that one of the data cleaning task is to create an indicator variable for whether a player is in the HOF
# A tibble: 250 × 5
Rank Player Years GP HOF
<chr> <chr> <chr> <int> <dbl>
1 1. Patrick Marleau 1997-21 1779 0
2 2. Gordie Howe* 1946-80 1767 1
3 3. Mark Messier* 1979-04 1756 1
4 4. Jaromír Jágr 1990-18 1733 0
5 5. Ron Francis* 1981-04 1731 1
6 6. Joe Thornton 1997-22 1714 0
7 7. Zdeno Chára 1997-22 1680 0
8 8. Mark Recchi* 1988-11 1652 1
9 9. Chris Chelios* 1983-10 1651 1
10 10. Dave Andreychuk* 1982-06 1639 1
# ℹ 240 more rows
str_remove()str_remove() takes in a string and removes a specified pattern.str_remove() can be used with mutate()
mutate() creates/modifies variables that are functions of existing variablesNow we build upon the previous code to finish the data cleaning process.
Recall than we want to remove the asterisk (*) and dot (.) from the Player and Rank columns, respectively.
nhl_tbl_cleaned <- nhl_tbl |>
mutate(HOF = ifelse(str_detect(Player, "\\*"), 1, 0),
Player = str_remove(Player, "\\*"),
Rank = str_remove(Rank, "\\."))
nhl_tbl_cleaned# A tibble: 250 × 5
Rank Player Years GP HOF
<chr> <chr> <chr> <int> <dbl>
1 1 Patrick Marleau 1997-21 1779 0
2 2 Gordie Howe 1946-80 1767 1
3 3 Mark Messier 1979-04 1756 1
4 4 Jaromír Jágr 1990-18 1733 0
5 5 Ron Francis 1981-04 1731 1
6 6 Joe Thornton 1997-22 1714 0
7 7 Zdeno Chára 1997-22 1680 0
8 8 Mark Recchi 1988-11 1652 1
9 9 Chris Chelios 1983-10 1651 1
10 10 Dave Andreychuk 1982-06 1639 1
# ℹ 240 more rows
Example: Frauen Bundesliga (German women’s soccer league)
URL: https://fbref.com/en/comps/183/2017-2018/2017-2018-Frauen-Bundesliga-Stats
The link above provides stats for the 2017-2018 season
Scrape the Overall table under Regular season
(Time permitting) Write a general function for scraping data for any specified season.
Hint: change the years in the URL and CSS selector
Get data for every season between 2016-2017 and 2019-2020 and combine them into a single table
fb_url <- "https://fbref.com/en/comps/183/2017-2018/2017-2018-Frauen-Bundesliga-Stats"
fb_tbl <- fb_url |>
read_html() |>
html_element(css = "#results2017-20181831_overall") |>
html_table()
fb_tbl# A tibble: 12 × 15
Rk Squad MP W D L GF GA GD Pts `Pts/MP`
<int> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
1 1 Wolfsburg 22 18 2 2 56 8 48 56 2.55
2 2 Bayern Munich 22 17 2 3 62 15 47 53 2.41
3 3 Freiburg 22 15 3 4 50 15 35 48 2.18
4 4 Turbine Potsd… 22 13 6 3 50 21 29 45 2.05
5 5 Essen 22 12 3 7 43 30 13 39 1.77
6 6 FFC Frankfurt 22 10 1 11 29 25 4 31 1.41
7 7 Sand 22 9 3 10 32 34 -2 30 1.36
8 8 Hoffenheim 22 8 1 13 22 32 -10 25 1.14
9 9 MSV Duisburg 22 6 0 16 16 33 -17 18 0.82
10 10 Werder Bremen 22 3 5 14 26 59 -33 14 0.64
11 11 Köln 22 3 2 17 8 78 -70 11 0.5
12 12 USV Jena 22 2 4 16 12 56 -44 10 0.45
# ℹ 4 more variables: Attendance <chr>, `Top Team Scorer` <chr>,
# Goalkeeper <chr>, Notes <chr>
get_fb_data <- function(start_year) {
year_str <- str_c(start_year, start_year + 1, sep = "-")
fb_url <- str_c("https://fbref.com/en/comps/183/", year_str, "/", year_str, "-Frauen-Bundesliga-Stats")
year_css <- str_c("#results", year_str, "1831_overall")
fb_tbl <- fb_url |>
read_html() |>
html_element(css = year_css) |>
html_table() |>
mutate(season = year_str)
return(fb_tbl)
}
seasons <- 2016:2019
fb_tbl_full <- seasons |>
map(get_fb_data) |>
list_rbind()
# fb_tbl_fullContinuing with the Frauen Bundesliga 2017-2018 season stats example…
Notice that the table on the webpage also contains team logos, URLs, etc.
These information were not extracted with html_table()
Suppose we’re also interested in getting the team URLs and logos
We first store the HTML node for the table in an object (i.e., everything up to html_element() with a specified CSS selector for the table)
This can then be used to obtain the images and team links based on their tags.
fb_url <- "https://fbref.com/en/comps/183/2017-2018/2017-2018-Frauen-Bundesliga-Stats"
fb_node <- fb_url |>
read_html() |>
html_element(css = "#results2017-20181831_overall")
fb_node{html_node}
<table class="stats_table sortable min_width force_mobilize" id="results2017-20181831_overall" data-cols-to-freeze=",2">
[1] <caption>Regular season Table</caption>
[2] <colgroup>\n<col>\n<col>\n<col>\n<col>\n<col>\n<col>\n<col>\n<col>\n<col> ...
[3] <thead><tr>\n<th aria-label="Rank" data-stat="rank" scope="col" class=" p ...
[4] <tbody>\n<tr>\n<th scope="row" class="right qualifier qualification_indic ...
To get all image elements, we can use html_elements() and specify the img tag
html_elements() is the “plural” version of html_element(), since we want ALL image elements, not just one (Honestly, if you don’t remember the differences, just try both and see which one is suitable){xml_nodeset (12)}
[1] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
[2] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
[3] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
[4] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
[5] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
[6] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
[7] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
[8] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
[9] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
[10] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
[11] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
[12] <img itemscope="image" height="13" width="13" src="https://cdn.ssref.net ...
html_attr() to get the src attribute [1] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.a1393014.png"
[2] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.51ec22be.png"
[3] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.b4de690d.png"
[4] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.de550500.png"
[5] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.becc1dd0.png"
[6] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.77d2e598.png"
[7] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.0cc34cf4.png"
[8] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.87705c62.png"
[9] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.0580d9a9.png"
[10] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.7adbf480.png"
[11] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.88ddc98e.png"
[12] "https://cdn.ssref.net/req/202307191/tlogo/fb/mini.765472c9.png"
To get all the URLs in the Squad column, we can first use html_elements() again and specify the "a" tag
<a> (anchor) tag defines a hyperlink{xml_nodeset (38)}
[1] <a href="/en/squads/a1393014/2017-2018/Wolfsburg-Women-Stats">Wolfsburg</a>
[2] <a href="/en/players/363b99a4/Pernille-Harder">Pernille Harder</a>
[3] <a href="/en/players/992b30a1/Almuth-Schult">Almuth Schult</a>
[4] <a href="/en/squads/51ec22be/2017-2018/Bayern-Munich-Women-Stats">Bayern ...
[5] <a href="/en/players/6862731d/Fridolina-Rolfo">Fridolina Rolfö</a>
[6] <a href="/en/players/c7dc2a33/Manuela-Zinsberger">Manuela Zinsberger</a>
[7] <a href="/en/squads/b4de690d/2017-2018/Freiburg-Women-Stats">Freiburg</a>
[8] <a href="/en/players/3cd04ba1/Lina-Magull">Lina Magull</a>
[9] <a href="/en/players/82c4f339/Laura-Benkarth">Laura Benkarth</a>
[10] <a href="/en/squads/de550500/2017-2018/Turbine-Potsdam-Stats">Turbine Po ...
[11] <a href="/en/players/8b5f141c/Svenja-Huth">Svenja Huth</a>
[12] <a href="/en/players/38bbb38c/Lisa-Schmitz">Lisa Schmitz</a>
[13] <a href="/en/squads/becc1dd0/2017-2018/Essen-Stats">Essen</a>
[14] <a href="/en/players/5a20e7f0/Linda-Dallmann">Linda Dallmann</a>
[15] <a href="/en/players/8699d87d/Lisa-Weiss">Lisa Weiß</a>
[16] <a href="/en/squads/77d2e598/2017-2018/FFC-Frankfurt-Stats">FFC Frankfur ...
[17] <a href="/en/players/6698c9f0/Jackie-Groenen">Jackie Groenen</a>
[18] <a href="/en/players/0dfe3e98/Bryane-Heaberlin">Bryane Heaberlin</a>
[19] <a href="/en/squads/0cc34cf4/2017-2018/Sand-Stats">Sand</a>
[20] <a href="/en/players/0f55e5ea/Nina-Burger">Nina Burger</a>
...
Within <a>, we can grab the href attribute with html_attr()
href indicates the URL/page associated with the link [1] "/en/squads/a1393014/2017-2018/Wolfsburg-Women-Stats"
[2] "/en/players/363b99a4/Pernille-Harder"
[3] "/en/players/992b30a1/Almuth-Schult"
[4] "/en/squads/51ec22be/2017-2018/Bayern-Munich-Women-Stats"
[5] "/en/players/6862731d/Fridolina-Rolfo"
[6] "/en/players/c7dc2a33/Manuela-Zinsberger"
[7] "/en/squads/b4de690d/2017-2018/Freiburg-Women-Stats"
[8] "/en/players/3cd04ba1/Lina-Magull"
[9] "/en/players/82c4f339/Laura-Benkarth"
[10] "/en/squads/de550500/2017-2018/Turbine-Potsdam-Stats"
[11] "/en/players/8b5f141c/Svenja-Huth"
[12] "/en/players/38bbb38c/Lisa-Schmitz"
[13] "/en/squads/becc1dd0/2017-2018/Essen-Stats"
[14] "/en/players/5a20e7f0/Linda-Dallmann"
[15] "/en/players/8699d87d/Lisa-Weiss"
[16] "/en/squads/77d2e598/2017-2018/FFC-Frankfurt-Stats"
[17] "/en/players/6698c9f0/Jackie-Groenen"
[18] "/en/players/0dfe3e98/Bryane-Heaberlin"
[19] "/en/squads/0cc34cf4/2017-2018/Sand-Stats"
[20] "/en/players/0f55e5ea/Nina-Burger"
[21] "/en/players/9474cd93/Carina-Schluter"
[22] "/en/squads/87705c62/2017-2018/Hoffenheim-Women-Stats"
[23] "/en/players/9e30ae90/Isabella-Hartig"
[24] "/en/players/277cdd4e/Tabea-Wassmuth"
[25] "/en/players/7c51bea4/Friederike-Abt"
[26] "/en/squads/0580d9a9/2017-2018/MSV-Duisburg-Women-Stats"
[27] "/en/players/88be98d1/Kathleen-Radtke"
[28] "/en/players/781a82de/Lena-Nuding"
[29] "/en/squads/7adbf480/2017-2018/Werder-Bremen-Women-Stats"
[30] "/en/players/e070cdf6/Nina-Luhrssen"
[31] "/en/players/e18dc3a3/Nora-Clausen"
[32] "/en/players/37469b31/Anneke-Borbe"
[33] "/en/squads/88ddc98e/2017-2018/Koln-Women-Stats"
[34] "/en/players/998749f3/Amber-Hearn"
[35] "/en/players/9abeb65b/Anne-Kathrine-Kremer"
[36] "/en/squads/765472c9/2017-2018/USV-Jena-Stats"
[37] "/en/players/b004aab5/Amelia-Pietrangelo"
[38] "/en/players/90c69bf5/Justien-Odeurs"
Notice that the previous output is a vector with all hyperlinks in the table, including all squads and players
Since we want squads only, we need to subset out all strings with the keyword "squads"
The function str_subset() comes in handy here
str_subset() returns only the vector elements that matches a pattern [1] "/en/squads/a1393014/2017-2018/Wolfsburg-Women-Stats"
[2] "/en/squads/51ec22be/2017-2018/Bayern-Munich-Women-Stats"
[3] "/en/squads/b4de690d/2017-2018/Freiburg-Women-Stats"
[4] "/en/squads/de550500/2017-2018/Turbine-Potsdam-Stats"
[5] "/en/squads/becc1dd0/2017-2018/Essen-Stats"
[6] "/en/squads/77d2e598/2017-2018/FFC-Frankfurt-Stats"
[7] "/en/squads/0cc34cf4/2017-2018/Sand-Stats"
[8] "/en/squads/87705c62/2017-2018/Hoffenheim-Women-Stats"
[9] "/en/squads/0580d9a9/2017-2018/MSV-Duisburg-Women-Stats"
[10] "/en/squads/7adbf480/2017-2018/Werder-Bremen-Women-Stats"
[11] "/en/squads/88ddc98e/2017-2018/Koln-Women-Stats"
[12] "/en/squads/765472c9/2017-2018/USV-Jena-Stats"
Let’s make a scatterplot of total number of goals scored (GF - goals for) and goals conceded (GA - goals against), and display the team logos.
As previously mentioned, data do not always come in the form of nicely formatted tables
Sometimes data are just simply raw text
Example: Wimbledon Women’s singles
URL: https://en.wikipedia.org/wiki/2009_Wimbledon_Championships_-_Women's_singles
Suppose we want to scrape results for the seeded players (under Seeds section)
wimbledon_url <- "https://en.wikipedia.org/wiki/2009_Wimbledon_Championships_-_Women's_singles"
wimbledon_url |>
read_html() |>
html_element("#mw-content-text > div.mw-parser-output > div:nth-child(13)"){html_node}
<div class="div-col">
[1] <dl>\n<dd>\n<span style="visibility:hidden;color:transparent;">0</span><a ...
[2] <dl>\n<dd>\n<a href="#Section_1">17</a>. <span class="flagicon"><span c ...
html_text2()wimbledon_url |>
read_html() |>
html_element("#mw-content-text > div.mw-parser-output > div:nth-child(13)") |>
html_text2()[1] "01. Dinara Safina (semifinals)\n02. Serena Williams (champion)\n03. Venus Williams (final)\n04. Elena Dementieva (semifinals)\n05. Svetlana Kuznetsova (third round)\n06. Jelena Janković (third round)\n07. Vera Zvonareva (third round, withdrew due to an ankle injury)\n08. Victoria Azarenka (quarterfinals)\n09. Caroline Wozniacki (fourth round)\n10. Nadia Petrova (fourth round)\n11. Agnieszka Radwańska (quarterfinals)\n12. Marion Bartoli (third round)\n13. Ana Ivanovic (fourth round, retired due to a thigh injury)\n14. Dominika Cibulková (third round)\n15. Flavia Pennetta (third round)\n16. Zheng Jie (second round)\n17. Amélie Mauresmo (fourth round)\n18. Samantha Stosur (third round)\n19. Li Na (third round)\n20. Anabel Medina Garrigues (third round)\n21. Patty Schnyder (first round)\n22. Alizé Cornet (first round)\n23. Aleksandra Wozniak (first round)\n24. Maria Sharapova (second round)\n25. Kaia Kanepi (first round)\n26. Virginie Razzano (fourth round)\n27. Alisa Kleybanova (second round)\n28. Sorana Cîrstea (third round)\n29. Sybille Bammer (first round)\n30. Ágnes Szávay (first round)\n31. Anastasia Pavlyuchenkova (second round)\n32. Anna Chakvetadze (first round)"
Notice this output a single string of all the text
Each combination of seed-player-result is separated by a newline character \n
There are many ways you can do to separate these - one way is with str_split_1()
str_split_1() splits a single string into a character vector based on a pattern
Other ways include: str_split() then unlist(), or read_lines(), and many more
wimbledon_info <- wimbledon_url |>
read_html() |>
html_element("#mw-content-text > div.mw-parser-output > div:nth-child(13)") |>
html_text2() |>
str_split_1("\\n")
wimbledon_info [1] "01. Dinara Safina (semifinals)"
[2] "02. Serena Williams (champion)"
[3] "03. Venus Williams (final)"
[4] "04. Elena Dementieva (semifinals)"
[5] "05. Svetlana Kuznetsova (third round)"
[6] "06. Jelena Janković (third round)"
[7] "07. Vera Zvonareva (third round, withdrew due to an ankle injury)"
[8] "08. Victoria Azarenka (quarterfinals)"
[9] "09. Caroline Wozniacki (fourth round)"
[10] "10. Nadia Petrova (fourth round)"
[11] "11. Agnieszka Radwańska (quarterfinals)"
[12] "12. Marion Bartoli (third round)"
[13] "13. Ana Ivanovic (fourth round, retired due to a thigh injury)"
[14] "14. Dominika Cibulková (third round)"
[15] "15. Flavia Pennetta (third round)"
[16] "16. Zheng Jie (second round)"
[17] "17. Amélie Mauresmo (fourth round)"
[18] "18. Samantha Stosur (third round)"
[19] "19. Li Na (third round)"
[20] "20. Anabel Medina Garrigues (third round)"
[21] "21. Patty Schnyder (first round)"
[22] "22. Alizé Cornet (first round)"
[23] "23. Aleksandra Wozniak (first round)"
[24] "24. Maria Sharapova (second round)"
[25] "25. Kaia Kanepi (first round)"
[26] "26. Virginie Razzano (fourth round)"
[27] "27. Alisa Kleybanova (second round)"
[28] "28. Sorana Cîrstea (third round)"
[29] "29. Sybille Bammer (first round)"
[30] "30. Ágnes Szávay (first round)"
[31] "31. Anastasia Pavlyuchenkova (second round)"
[32] "32. Anna Chakvetadze (first round)"
(This might involve tasks like extracting text between parentheses, locating special characters like . ( ), etc. — check out this blog post)
An API (Application Programming Interface) connects computer programs to each other
Web APIs provide interactions between a client device and a web server using the Hypertext Transfer Protocol (HTTP)
Clients send a (HTTP) request and receive a response (in JSON or XML)
Many organizations have their own public API, which can be used to access data
Fortunately, there exists many R packages (sports and non-sports) that provides access to APIs for obtaining data
Note that these packages (“API wrappers”) do not provide the actual data - instead functions for accessing the data
For sports, check out the Sports Analytics CRAN Task View and SportsDataverse for more information
httr packagehttr offers a general way of getting data from an API, via different tools for working with HTTP
GET() sends a request to an API and captures the response
content() extracts out the data from the response
These 2 functions are illustrated in the next example
There are many other useful functions in httr
For example, PUT() and POST() can be used to send data to APIs
Other popular verbs are PATCH(), HEAD(), and DELETE()
Example: Formula One API (Inspiration: Tidy Tuesday 2021-09-07)
Ergast Developer API (http://ergast.com/mrd/) is an (experimental) API that provides Formula One historical data
Suppose we’re interested in getting a table of every F1 winning constructor
GET() to send a request to the API. We then receive the data via a responselibrary(httr)
f1_api <- "http://ergast.com/api/f1/constructorStandings/1/constructors.json"
f1_response <- f1_api |>
GET()
f1_responseResponse [http://ergast.com/api/f1/constructorStandings/1/constructors.json]
Date: 2023-07-20 14:18
Status: 200
Content-Type: application/json; charset=utf-8
Size: 2.4 kB
content(). We can then view the structure of the content object.List of 1
$ MRData:List of 7
..$ xmlns : chr "http://ergast.com/mrd/1.5"
..$ series : chr "f1"
..$ url : chr "http://ergast.com/api/f1/constructorstandings/1/constructors.json"
..$ limit : chr "30"
..$ offset : chr "0"
..$ total : chr "17"
..$ ConstructorTable:List of 2
.. ..$ constructorStandings: chr "1"
.. ..$ Constructors :List of 17
f1_constructor_tbl <- f1_constructor_list |>
as_tibble_col(column_name = "info") |> # convert list to tibble
unnest_wider(info) # unnest a list-column into columns
f1_constructor_tbl# A tibble: 17 × 4
constructorId url name nationality
<chr> <chr> <chr> <chr>
1 benetton http://en.wikipedia.org/wiki/Benetton_Formula Bene… Italian
2 brabham-repco http://en.wikipedia.org/wiki/Brabham Brab… British
3 brawn http://en.wikipedia.org/wiki/Brawn_GP Brawn British
4 brm http://en.wikipedia.org/wiki/BRM BRM British
5 cooper-climax http://en.wikipedia.org/wiki/Cooper_Car_Comp… Coop… British
6 ferrari http://en.wikipedia.org/wiki/Scuderia_Ferrari Ferr… Italian
7 lotus-climax http://en.wikipedia.org/wiki/Team_Lotus Lotu… British
8 lotus-ford http://en.wikipedia.org/wiki/Team_Lotus Lotu… British
9 matra-ford http://en.wikipedia.org/wiki/Matra Matr… French
10 mclaren http://en.wikipedia.org/wiki/McLaren McLa… British
11 mercedes http://en.wikipedia.org/wiki/Mercedes-Benz_i… Merc… German
12 red_bull http://en.wikipedia.org/wiki/Red_Bull_Racing Red … Austrian
13 renault http://en.wikipedia.org/wiki/Renault_in_Form… Rena… French
14 team_lotus http://en.wikipedia.org/wiki/Team_Lotus Team… British
15 tyrrell http://en.wikipedia.org/wiki/Tyrrell_Racing Tyrr… British
16 vanwall http://en.wikipedia.org/wiki/Vanwall Vanw… British
17 williams http://en.wikipedia.org/wiki/Williams_Grand_… Will… British
Great article on Ethics in Web Scraping, featuring a “web scraping manifesto”
Web scraping case study from the Data science ethics chapter of MDSR
Good practice chapter from Web Scraping using R
Scraping ethics and legalities section from Web scraping chapter of R4DS (2e)
Common points
Be mindful of the terms of use of every website
Anonymize personal data, especially if data/analysis are to be publicly released
Take advantage of APIs
Only scrape what you need
polite package: overviewpolite ensures that you’re respecting the robots.txt 1 and not submitting too many requests
bow() introduces the user to the host and asks for scraping permission
scrape() scrapes and retrieves data
(Sometimes, nod() is required as an intermediate step, to agree modification of session path with host)
polite package: exampleExample: Wimbledon Women’s singles (same as before)
First, pass the URL into bow() to get a “session” object
robots.txt and whether the webpage is scrapablelibrary(polite)
wimbledon_url <- "https://en.wikipedia.org/wiki/2009_Wimbledon_Championships_-_Women's_singles"
session <- wimbledon_url |>
bow()
session<polite session> https://en.wikipedia.org/wiki/2009_Wimbledon_Championships_-_Women's_singles
User-agent: polite R package
robots.txt: 456 rules are defined for 33 bots
Crawl delay: 5 sec
The path is scrapable for this user-agent
polite package: exampleNow, use scrape() to get the data from the session previously created by bow()
read_html() as seen earlier{html_document}
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="skin-vector skin-vector-search-vue mediawiki ltr sitedir-ltr ...
polite package: examplesession |>
scrape() |>
html_element("#mw-content-text > div.mw-parser-output > div:nth-child(13)") |>
html_text2() |>
str_split_1("\\n") [1] "01. Dinara Safina (semifinals)"
[2] "02. Serena Williams (champion)"
[3] "03. Venus Williams (final)"
[4] "04. Elena Dementieva (semifinals)"
[5] "05. Svetlana Kuznetsova (third round)"
[6] "06. Jelena Janković (third round)"
[7] "07. Vera Zvonareva (third round, withdrew due to an ankle injury)"
[8] "08. Victoria Azarenka (quarterfinals)"
[9] "09. Caroline Wozniacki (fourth round)"
[10] "10. Nadia Petrova (fourth round)"
[11] "11. Agnieszka Radwańska (quarterfinals)"
[12] "12. Marion Bartoli (third round)"
[13] "13. Ana Ivanovic (fourth round, retired due to a thigh injury)"
[14] "14. Dominika Cibulková (third round)"
[15] "15. Flavia Pennetta (third round)"
[16] "16. Zheng Jie (second round)"
[17] "17. Amélie Mauresmo (fourth round)"
[18] "18. Samantha Stosur (third round)"
[19] "19. Li Na (third round)"
[20] "20. Anabel Medina Garrigues (third round)"
[21] "21. Patty Schnyder (first round)"
[22] "22. Alizé Cornet (first round)"
[23] "23. Aleksandra Wozniak (first round)"
[24] "24. Maria Sharapova (second round)"
[25] "25. Kaia Kanepi (first round)"
[26] "26. Virginie Razzano (fourth round)"
[27] "27. Alisa Kleybanova (second round)"
[28] "28. Sorana Cîrstea (third round)"
[29] "29. Sybille Bammer (first round)"
[30] "30. Ágnes Szávay (first round)"
[31] "31. Anastasia Pavlyuchenkova (second round)"
[32] "32. Anna Chakvetadze (first round)"
polite package page (more examples, featuring a template for package developers)
Browse through source code of different R “scraper” packages
Web scraping is an excellent means for gaining proficiency in data cleaning
It takes time - the more you play around the better you get
Inspect the output at each step
Consult the help documentations
Come up with fun personal projects (data viz, Shiny app, etc.), scrape data, and enjoy learning (and the struggle)
You can develop the next great sports “scrapR”1 package(s) (or even for your field of interest)