Looping through alphabetical pages (rvest)
After spending a lot of time on this problem, and looking through available answers, I wanted to go ahead and ask a new question to address the problem I have webscraping with R and rvest. I have attempted to fully lay out the problem to minimize questions
The Problem
I am trying to extract the author names from a conference webpage. The authors are separated alphabetically by their last name; hence, I need to use a for loop to call follow_link() 25 times to go to each page and extract the pertinent author text.
The conference website:
https://gsa.confex.com/gsa/2016AM/webprogram/authora.html
I have attempted two solutions in R using rvest, both with problems.
Solution 1 (Letter call to link)
lttrs <- LETTERS[seq( from = 1, to = 26 )] # create character vector
website <- html_session(https://gsa.confex.com/gsa/2016AM/webprogram/authora.html)
tempList <- list() #create list to store each page's author information
for(i in 1:length(lttrs)){
tempList[[i]] <- website %>%
follow_link(lttrs[i])%>% #use capital letters to call links to author pages
html_nodes(xpath ='//*[@class = "author"]') %>%
html_text()
}
This code works.. to a point. Below is the output. It will successfully navigate through lettered pages until the H-I transition and the L-M transition at which point it grabs the wrong page.
Navigating to authora.html
Navigating to authorb.html
Navigating to authorc.html
Navigating to authord.html
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authora.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to http://community.geosociety.org/gsa2016/home
Solution 2 (CSS call to link)
Using a CSS selector on the page, each lettered page is identified as an "a:nth-child(1-26)". So I reconstructed my loop using a call for that CSS identifier.
tempList <- list()
for(i in 2:length(lttrs)){
tempList[[i]] <- website %>%
follow_link(css = paste('a:nth-child(',i,')',sep = '')) %>%
html_nodes(xpath ='//*[@class = "author"]') %>%
html_text()
}
This works kindof. Again it has trouble with certain transitions (see below)
Navigating to authora.html
Navigating to uploadlistall.html
Navigating to http://community.geosociety.org/gsa2016/home
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authori.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to authorm.html
Navigating to authorn.html
Navigating to authoro.html
Navigating to authorp.html
Navigating to authorq.html
Navigating to authorr.html
Navigating to authors.html
Navigating to authort.html
Navigating to authoru.html
Navigating to authorv.html
Navigating to authorw.html
Navigating to authorx.html
Navigating to authory.html
Navigating to authorz.html
Specifically, this method misses B,C, and D. Looping to the incorrect pages at this step. I would greatly appreciate any insights or directions for how my above code could be reconfigured to correctly loop through all 26 alphabetical pages.
Thank you so much!
css r web-scraping rvest
add a comment |
After spending a lot of time on this problem, and looking through available answers, I wanted to go ahead and ask a new question to address the problem I have webscraping with R and rvest. I have attempted to fully lay out the problem to minimize questions
The Problem
I am trying to extract the author names from a conference webpage. The authors are separated alphabetically by their last name; hence, I need to use a for loop to call follow_link() 25 times to go to each page and extract the pertinent author text.
The conference website:
https://gsa.confex.com/gsa/2016AM/webprogram/authora.html
I have attempted two solutions in R using rvest, both with problems.
Solution 1 (Letter call to link)
lttrs <- LETTERS[seq( from = 1, to = 26 )] # create character vector
website <- html_session(https://gsa.confex.com/gsa/2016AM/webprogram/authora.html)
tempList <- list() #create list to store each page's author information
for(i in 1:length(lttrs)){
tempList[[i]] <- website %>%
follow_link(lttrs[i])%>% #use capital letters to call links to author pages
html_nodes(xpath ='//*[@class = "author"]') %>%
html_text()
}
This code works.. to a point. Below is the output. It will successfully navigate through lettered pages until the H-I transition and the L-M transition at which point it grabs the wrong page.
Navigating to authora.html
Navigating to authorb.html
Navigating to authorc.html
Navigating to authord.html
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authora.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to http://community.geosociety.org/gsa2016/home
Solution 2 (CSS call to link)
Using a CSS selector on the page, each lettered page is identified as an "a:nth-child(1-26)". So I reconstructed my loop using a call for that CSS identifier.
tempList <- list()
for(i in 2:length(lttrs)){
tempList[[i]] <- website %>%
follow_link(css = paste('a:nth-child(',i,')',sep = '')) %>%
html_nodes(xpath ='//*[@class = "author"]') %>%
html_text()
}
This works kindof. Again it has trouble with certain transitions (see below)
Navigating to authora.html
Navigating to uploadlistall.html
Navigating to http://community.geosociety.org/gsa2016/home
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authori.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to authorm.html
Navigating to authorn.html
Navigating to authoro.html
Navigating to authorp.html
Navigating to authorq.html
Navigating to authorr.html
Navigating to authors.html
Navigating to authort.html
Navigating to authoru.html
Navigating to authorv.html
Navigating to authorw.html
Navigating to authorx.html
Navigating to authory.html
Navigating to authorz.html
Specifically, this method misses B,C, and D. Looping to the incorrect pages at this step. I would greatly appreciate any insights or directions for how my above code could be reconfigured to correctly loop through all 26 alphabetical pages.
Thank you so much!
css r web-scraping rvest
add a comment |
After spending a lot of time on this problem, and looking through available answers, I wanted to go ahead and ask a new question to address the problem I have webscraping with R and rvest. I have attempted to fully lay out the problem to minimize questions
The Problem
I am trying to extract the author names from a conference webpage. The authors are separated alphabetically by their last name; hence, I need to use a for loop to call follow_link() 25 times to go to each page and extract the pertinent author text.
The conference website:
https://gsa.confex.com/gsa/2016AM/webprogram/authora.html
I have attempted two solutions in R using rvest, both with problems.
Solution 1 (Letter call to link)
lttrs <- LETTERS[seq( from = 1, to = 26 )] # create character vector
website <- html_session(https://gsa.confex.com/gsa/2016AM/webprogram/authora.html)
tempList <- list() #create list to store each page's author information
for(i in 1:length(lttrs)){
tempList[[i]] <- website %>%
follow_link(lttrs[i])%>% #use capital letters to call links to author pages
html_nodes(xpath ='//*[@class = "author"]') %>%
html_text()
}
This code works.. to a point. Below is the output. It will successfully navigate through lettered pages until the H-I transition and the L-M transition at which point it grabs the wrong page.
Navigating to authora.html
Navigating to authorb.html
Navigating to authorc.html
Navigating to authord.html
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authora.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to http://community.geosociety.org/gsa2016/home
Solution 2 (CSS call to link)
Using a CSS selector on the page, each lettered page is identified as an "a:nth-child(1-26)". So I reconstructed my loop using a call for that CSS identifier.
tempList <- list()
for(i in 2:length(lttrs)){
tempList[[i]] <- website %>%
follow_link(css = paste('a:nth-child(',i,')',sep = '')) %>%
html_nodes(xpath ='//*[@class = "author"]') %>%
html_text()
}
This works kindof. Again it has trouble with certain transitions (see below)
Navigating to authora.html
Navigating to uploadlistall.html
Navigating to http://community.geosociety.org/gsa2016/home
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authori.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to authorm.html
Navigating to authorn.html
Navigating to authoro.html
Navigating to authorp.html
Navigating to authorq.html
Navigating to authorr.html
Navigating to authors.html
Navigating to authort.html
Navigating to authoru.html
Navigating to authorv.html
Navigating to authorw.html
Navigating to authorx.html
Navigating to authory.html
Navigating to authorz.html
Specifically, this method misses B,C, and D. Looping to the incorrect pages at this step. I would greatly appreciate any insights or directions for how my above code could be reconfigured to correctly loop through all 26 alphabetical pages.
Thank you so much!
css r web-scraping rvest
After spending a lot of time on this problem, and looking through available answers, I wanted to go ahead and ask a new question to address the problem I have webscraping with R and rvest. I have attempted to fully lay out the problem to minimize questions
The Problem
I am trying to extract the author names from a conference webpage. The authors are separated alphabetically by their last name; hence, I need to use a for loop to call follow_link() 25 times to go to each page and extract the pertinent author text.
The conference website:
https://gsa.confex.com/gsa/2016AM/webprogram/authora.html
I have attempted two solutions in R using rvest, both with problems.
Solution 1 (Letter call to link)
lttrs <- LETTERS[seq( from = 1, to = 26 )] # create character vector
website <- html_session(https://gsa.confex.com/gsa/2016AM/webprogram/authora.html)
tempList <- list() #create list to store each page's author information
for(i in 1:length(lttrs)){
tempList[[i]] <- website %>%
follow_link(lttrs[i])%>% #use capital letters to call links to author pages
html_nodes(xpath ='//*[@class = "author"]') %>%
html_text()
}
This code works.. to a point. Below is the output. It will successfully navigate through lettered pages until the H-I transition and the L-M transition at which point it grabs the wrong page.
Navigating to authora.html
Navigating to authorb.html
Navigating to authorc.html
Navigating to authord.html
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authora.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to http://community.geosociety.org/gsa2016/home
Solution 2 (CSS call to link)
Using a CSS selector on the page, each lettered page is identified as an "a:nth-child(1-26)". So I reconstructed my loop using a call for that CSS identifier.
tempList <- list()
for(i in 2:length(lttrs)){
tempList[[i]] <- website %>%
follow_link(css = paste('a:nth-child(',i,')',sep = '')) %>%
html_nodes(xpath ='//*[@class = "author"]') %>%
html_text()
}
This works kindof. Again it has trouble with certain transitions (see below)
Navigating to authora.html
Navigating to uploadlistall.html
Navigating to http://community.geosociety.org/gsa2016/home
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authori.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to authorm.html
Navigating to authorn.html
Navigating to authoro.html
Navigating to authorp.html
Navigating to authorq.html
Navigating to authorr.html
Navigating to authors.html
Navigating to authort.html
Navigating to authoru.html
Navigating to authorv.html
Navigating to authorw.html
Navigating to authorx.html
Navigating to authory.html
Navigating to authorz.html
Specifically, this method misses B,C, and D. Looping to the incorrect pages at this step. I would greatly appreciate any insights or directions for how my above code could be reconfigured to correctly loop through all 26 alphabetical pages.
Thank you so much!
css r web-scraping rvest
css r web-scraping rvest
asked Nov 25 '18 at 14:37
morrismcmorrismc
82
82
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Welcome to SO (and kudos on a 👍🏼 first question).
You seem to have gotten super lucky as the robots.txt
for that site has a ton of entries but doesn't try to restrict what you're doing.
We can pull all of the href
s in the alphabet pagination links at the bottom of the page with html_nodes(pg, "a[href^='author']")
. Below grabs all the paper links from all the authors:
library(rvest)
library(tidyverse)
pg <- read_html("https://gsa.confex.com/gsa/2016AM/webprogram/authora.html")
html_nodes(pg, "a[href^='author']") %>%
html_attr("href") %>%
sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>%
{ pb <<- progress_estimated(length(.)) ; . } %>% # we'll use a progress bar as this will take ~3m
map_df(~{
pb$tick()$print() # increment progress bar
Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay
read_html(.x) %>%
html_nodes("div.item > div.author") %>%
map_df(~{
data_frame(
author = html_text(.x, trim = TRUE),
paper = html_nodes(.x, xpath="../div[@class='papers']/a") %>%
html_text(trim = TRUE),
paper_url = html_nodes(.x, xpath="../div[@class='papers']/a") %>%
html_attr("href") %>%
sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .)
)
})
}) -> author_papers
author_papers
## # A tibble: 34,983 x 3
## author paper paper_url
## <chr> <chr> <chr>
## 1 Aadahl, Kristopher 296-5 https://gsa.confex.com/gsa/2016AM/webprogram/Paper283542.html
## 2 Aanderud, Zachary T. 215-11 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286442.html
## 3 Abbey, Alyssa 54-4 https://gsa.confex.com/gsa/2016AM/webprogram/Paper281801.html
## 4 Abbott, Dallas H. 341-34 https://gsa.confex.com/gsa/2016AM/webprogram/Paper287404.html
## 5 Abbott Jr., David M. 38-6 https://gsa.confex.com/gsa/2016AM/webprogram/Paper278060.html
## 6 Abbott, Grant 58-7 https://gsa.confex.com/gsa/2016AM/webprogram/Paper283414.html
## 7 Abbott, Jared 29-10 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286237.html
## 8 Abbott, Jared 317-9 https://gsa.confex.com/gsa/2016AM/webprogram/Paper282386.html
## 9 Abbott, Kathryn A. 187-9 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286127.html
## 10 Abbott, Lon D. 208-16 https://gsa.confex.com/gsa/2016AM/webprogram/Paper280093.html
## # ... with 34,973 more rows
I don't know what you need off of the individual paper pages so you can do that.
You also don't have to wait ~3m since the the author_papers
data frame is in this RDS file: https://rud.is/dl/author-papers.rds which you can read with:
readRDS(url("https://rud.is/dl/author-papers.rds"))
If you do plan on scraping 34,983 papers then please continue to heed to "don't be rude" and use a crawl delay (ref: https://rud.is/b/2017/07/28/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r/).
UPDATE
html_nodes(pg, "a[href^='author']") %>%
html_attr("href") %>%
sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>%
{ pb <<- progress_estimated(length(.)) ; . } %>% # we'll use a progress bar as this will take ~3m
map_df(~{
pb$tick()$print() # increment progress bar
Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay
read_html(.x) %>%
html_nodes("div.item > div.author") %>%
map_df(~{
data_frame(
author = html_text(.x, trim = TRUE),
is_presenting = html_nodes(.x, xpath="../div[@class='papers']") %>%
html_text(trim = TRUE) %>% # retrieve the text of all the "papers"
paste0(collapse=" ") %>% # just in case there are multiple nodes we flatten them into one
grepl("*", ., fixed=TRUE) # make it TRUE if we find the "*"
)
})
}) -> author_with_presenter_status
author_with_presenter_status
## # A tibble: 22,545 x 2
## author is_presenting
## <chr> <lgl>
## 1 Aadahl, Kristopher FALSE
## 2 Aanderud, Zachary T. FALSE
## 3 Abbey, Alyssa TRUE
## 4 Abbott, Dallas H. FALSE
## 5 Abbott Jr., David M. TRUE
## 6 Abbott, Grant FALSE
## 7 Abbott, Jared FALSE
## 8 Abbott, Kathryn A. FALSE
## 9 Abbott, Lon D. FALSE
## 10 Abbott, Mark B. FALSE
## # ... with 22,535 more rows
Which you can also retrieve with:
readRDS(url("https://rud.is/dl/author-presenter.rds"))
Thanks for this! It may be doing a bit more than I need. All I need to do is collect all of the author names from each page from the 2016 conference. Perhaps this does a bit more than I need in terms of scraping. However, I was able to clean up the dataframe with unique() as there are multiple counts of authors who are listed on multiple abstracts.
– morrismc
Nov 25 '18 at 15:45
Ah one more question, if I wanted to note whether or not someone was the presenting author, denoted with an * in the page. Would there be a simply solution to create a new column of True/False for presenting status?
– morrismc
Nov 25 '18 at 15:51
Updated to just get the author and whether they were a presenter.
– hrbrmstr
Nov 25 '18 at 16:07
Woohoo Thanks so much! I really appreciate it :)
– morrismc
Nov 25 '18 at 16:26
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53468576%2flooping-through-alphabetical-pages-rvest%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Welcome to SO (and kudos on a 👍🏼 first question).
You seem to have gotten super lucky as the robots.txt
for that site has a ton of entries but doesn't try to restrict what you're doing.
We can pull all of the href
s in the alphabet pagination links at the bottom of the page with html_nodes(pg, "a[href^='author']")
. Below grabs all the paper links from all the authors:
library(rvest)
library(tidyverse)
pg <- read_html("https://gsa.confex.com/gsa/2016AM/webprogram/authora.html")
html_nodes(pg, "a[href^='author']") %>%
html_attr("href") %>%
sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>%
{ pb <<- progress_estimated(length(.)) ; . } %>% # we'll use a progress bar as this will take ~3m
map_df(~{
pb$tick()$print() # increment progress bar
Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay
read_html(.x) %>%
html_nodes("div.item > div.author") %>%
map_df(~{
data_frame(
author = html_text(.x, trim = TRUE),
paper = html_nodes(.x, xpath="../div[@class='papers']/a") %>%
html_text(trim = TRUE),
paper_url = html_nodes(.x, xpath="../div[@class='papers']/a") %>%
html_attr("href") %>%
sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .)
)
})
}) -> author_papers
author_papers
## # A tibble: 34,983 x 3
## author paper paper_url
## <chr> <chr> <chr>
## 1 Aadahl, Kristopher 296-5 https://gsa.confex.com/gsa/2016AM/webprogram/Paper283542.html
## 2 Aanderud, Zachary T. 215-11 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286442.html
## 3 Abbey, Alyssa 54-4 https://gsa.confex.com/gsa/2016AM/webprogram/Paper281801.html
## 4 Abbott, Dallas H. 341-34 https://gsa.confex.com/gsa/2016AM/webprogram/Paper287404.html
## 5 Abbott Jr., David M. 38-6 https://gsa.confex.com/gsa/2016AM/webprogram/Paper278060.html
## 6 Abbott, Grant 58-7 https://gsa.confex.com/gsa/2016AM/webprogram/Paper283414.html
## 7 Abbott, Jared 29-10 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286237.html
## 8 Abbott, Jared 317-9 https://gsa.confex.com/gsa/2016AM/webprogram/Paper282386.html
## 9 Abbott, Kathryn A. 187-9 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286127.html
## 10 Abbott, Lon D. 208-16 https://gsa.confex.com/gsa/2016AM/webprogram/Paper280093.html
## # ... with 34,973 more rows
I don't know what you need off of the individual paper pages so you can do that.
You also don't have to wait ~3m since the the author_papers
data frame is in this RDS file: https://rud.is/dl/author-papers.rds which you can read with:
readRDS(url("https://rud.is/dl/author-papers.rds"))
If you do plan on scraping 34,983 papers then please continue to heed to "don't be rude" and use a crawl delay (ref: https://rud.is/b/2017/07/28/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r/).
UPDATE
html_nodes(pg, "a[href^='author']") %>%
html_attr("href") %>%
sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>%
{ pb <<- progress_estimated(length(.)) ; . } %>% # we'll use a progress bar as this will take ~3m
map_df(~{
pb$tick()$print() # increment progress bar
Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay
read_html(.x) %>%
html_nodes("div.item > div.author") %>%
map_df(~{
data_frame(
author = html_text(.x, trim = TRUE),
is_presenting = html_nodes(.x, xpath="../div[@class='papers']") %>%
html_text(trim = TRUE) %>% # retrieve the text of all the "papers"
paste0(collapse=" ") %>% # just in case there are multiple nodes we flatten them into one
grepl("*", ., fixed=TRUE) # make it TRUE if we find the "*"
)
})
}) -> author_with_presenter_status
author_with_presenter_status
## # A tibble: 22,545 x 2
## author is_presenting
## <chr> <lgl>
## 1 Aadahl, Kristopher FALSE
## 2 Aanderud, Zachary T. FALSE
## 3 Abbey, Alyssa TRUE
## 4 Abbott, Dallas H. FALSE
## 5 Abbott Jr., David M. TRUE
## 6 Abbott, Grant FALSE
## 7 Abbott, Jared FALSE
## 8 Abbott, Kathryn A. FALSE
## 9 Abbott, Lon D. FALSE
## 10 Abbott, Mark B. FALSE
## # ... with 22,535 more rows
Which you can also retrieve with:
readRDS(url("https://rud.is/dl/author-presenter.rds"))
Thanks for this! It may be doing a bit more than I need. All I need to do is collect all of the author names from each page from the 2016 conference. Perhaps this does a bit more than I need in terms of scraping. However, I was able to clean up the dataframe with unique() as there are multiple counts of authors who are listed on multiple abstracts.
– morrismc
Nov 25 '18 at 15:45
Ah one more question, if I wanted to note whether or not someone was the presenting author, denoted with an * in the page. Would there be a simply solution to create a new column of True/False for presenting status?
– morrismc
Nov 25 '18 at 15:51
Updated to just get the author and whether they were a presenter.
– hrbrmstr
Nov 25 '18 at 16:07
Woohoo Thanks so much! I really appreciate it :)
– morrismc
Nov 25 '18 at 16:26
add a comment |
Welcome to SO (and kudos on a 👍🏼 first question).
You seem to have gotten super lucky as the robots.txt
for that site has a ton of entries but doesn't try to restrict what you're doing.
We can pull all of the href
s in the alphabet pagination links at the bottom of the page with html_nodes(pg, "a[href^='author']")
. Below grabs all the paper links from all the authors:
library(rvest)
library(tidyverse)
pg <- read_html("https://gsa.confex.com/gsa/2016AM/webprogram/authora.html")
html_nodes(pg, "a[href^='author']") %>%
html_attr("href") %>%
sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>%
{ pb <<- progress_estimated(length(.)) ; . } %>% # we'll use a progress bar as this will take ~3m
map_df(~{
pb$tick()$print() # increment progress bar
Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay
read_html(.x) %>%
html_nodes("div.item > div.author") %>%
map_df(~{
data_frame(
author = html_text(.x, trim = TRUE),
paper = html_nodes(.x, xpath="../div[@class='papers']/a") %>%
html_text(trim = TRUE),
paper_url = html_nodes(.x, xpath="../div[@class='papers']/a") %>%
html_attr("href") %>%
sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .)
)
})
}) -> author_papers
author_papers
## # A tibble: 34,983 x 3
## author paper paper_url
## <chr> <chr> <chr>
## 1 Aadahl, Kristopher 296-5 https://gsa.confex.com/gsa/2016AM/webprogram/Paper283542.html
## 2 Aanderud, Zachary T. 215-11 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286442.html
## 3 Abbey, Alyssa 54-4 https://gsa.confex.com/gsa/2016AM/webprogram/Paper281801.html
## 4 Abbott, Dallas H. 341-34 https://gsa.confex.com/gsa/2016AM/webprogram/Paper287404.html
## 5 Abbott Jr., David M. 38-6 https://gsa.confex.com/gsa/2016AM/webprogram/Paper278060.html
## 6 Abbott, Grant 58-7 https://gsa.confex.com/gsa/2016AM/webprogram/Paper283414.html
## 7 Abbott, Jared 29-10 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286237.html
## 8 Abbott, Jared 317-9 https://gsa.confex.com/gsa/2016AM/webprogram/Paper282386.html
## 9 Abbott, Kathryn A. 187-9 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286127.html
## 10 Abbott, Lon D. 208-16 https://gsa.confex.com/gsa/2016AM/webprogram/Paper280093.html
## # ... with 34,973 more rows
I don't know what you need off of the individual paper pages so you can do that.
You also don't have to wait ~3m since the the author_papers
data frame is in this RDS file: https://rud.is/dl/author-papers.rds which you can read with:
readRDS(url("https://rud.is/dl/author-papers.rds"))
If you do plan on scraping 34,983 papers then please continue to heed to "don't be rude" and use a crawl delay (ref: https://rud.is/b/2017/07/28/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r/).
UPDATE
html_nodes(pg, "a[href^='author']") %>%
html_attr("href") %>%
sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>%
{ pb <<- progress_estimated(length(.)) ; . } %>% # we'll use a progress bar as this will take ~3m
map_df(~{
pb$tick()$print() # increment progress bar
Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay
read_html(.x) %>%
html_nodes("div.item > div.author") %>%
map_df(~{
data_frame(
author = html_text(.x, trim = TRUE),
is_presenting = html_nodes(.x, xpath="../div[@class='papers']") %>%
html_text(trim = TRUE) %>% # retrieve the text of all the "papers"
paste0(collapse=" ") %>% # just in case there are multiple nodes we flatten them into one
grepl("*", ., fixed=TRUE) # make it TRUE if we find the "*"
)
})
}) -> author_with_presenter_status
author_with_presenter_status
## # A tibble: 22,545 x 2
## author is_presenting
## <chr> <lgl>
## 1 Aadahl, Kristopher FALSE
## 2 Aanderud, Zachary T. FALSE
## 3 Abbey, Alyssa TRUE
## 4 Abbott, Dallas H. FALSE
## 5 Abbott Jr., David M. TRUE
## 6 Abbott, Grant FALSE
## 7 Abbott, Jared FALSE
## 8 Abbott, Kathryn A. FALSE
## 9 Abbott, Lon D. FALSE
## 10 Abbott, Mark B. FALSE
## # ... with 22,535 more rows
Which you can also retrieve with:
readRDS(url("https://rud.is/dl/author-presenter.rds"))
Thanks for this! It may be doing a bit more than I need. All I need to do is collect all of the author names from each page from the 2016 conference. Perhaps this does a bit more than I need in terms of scraping. However, I was able to clean up the dataframe with unique() as there are multiple counts of authors who are listed on multiple abstracts.
– morrismc
Nov 25 '18 at 15:45
Ah one more question, if I wanted to note whether or not someone was the presenting author, denoted with an * in the page. Would there be a simply solution to create a new column of True/False for presenting status?
– morrismc
Nov 25 '18 at 15:51
Updated to just get the author and whether they were a presenter.
– hrbrmstr
Nov 25 '18 at 16:07
Woohoo Thanks so much! I really appreciate it :)
– morrismc
Nov 25 '18 at 16:26
add a comment |
Welcome to SO (and kudos on a 👍🏼 first question).
You seem to have gotten super lucky as the robots.txt
for that site has a ton of entries but doesn't try to restrict what you're doing.
We can pull all of the href
s in the alphabet pagination links at the bottom of the page with html_nodes(pg, "a[href^='author']")
. Below grabs all the paper links from all the authors:
library(rvest)
library(tidyverse)
pg <- read_html("https://gsa.confex.com/gsa/2016AM/webprogram/authora.html")
html_nodes(pg, "a[href^='author']") %>%
html_attr("href") %>%
sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>%
{ pb <<- progress_estimated(length(.)) ; . } %>% # we'll use a progress bar as this will take ~3m
map_df(~{
pb$tick()$print() # increment progress bar
Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay
read_html(.x) %>%
html_nodes("div.item > div.author") %>%
map_df(~{
data_frame(
author = html_text(.x, trim = TRUE),
paper = html_nodes(.x, xpath="../div[@class='papers']/a") %>%
html_text(trim = TRUE),
paper_url = html_nodes(.x, xpath="../div[@class='papers']/a") %>%
html_attr("href") %>%
sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .)
)
})
}) -> author_papers
author_papers
## # A tibble: 34,983 x 3
## author paper paper_url
## <chr> <chr> <chr>
## 1 Aadahl, Kristopher 296-5 https://gsa.confex.com/gsa/2016AM/webprogram/Paper283542.html
## 2 Aanderud, Zachary T. 215-11 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286442.html
## 3 Abbey, Alyssa 54-4 https://gsa.confex.com/gsa/2016AM/webprogram/Paper281801.html
## 4 Abbott, Dallas H. 341-34 https://gsa.confex.com/gsa/2016AM/webprogram/Paper287404.html
## 5 Abbott Jr., David M. 38-6 https://gsa.confex.com/gsa/2016AM/webprogram/Paper278060.html
## 6 Abbott, Grant 58-7 https://gsa.confex.com/gsa/2016AM/webprogram/Paper283414.html
## 7 Abbott, Jared 29-10 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286237.html
## 8 Abbott, Jared 317-9 https://gsa.confex.com/gsa/2016AM/webprogram/Paper282386.html
## 9 Abbott, Kathryn A. 187-9 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286127.html
## 10 Abbott, Lon D. 208-16 https://gsa.confex.com/gsa/2016AM/webprogram/Paper280093.html
## # ... with 34,973 more rows
I don't know what you need off of the individual paper pages so you can do that.
You also don't have to wait ~3m since the the author_papers
data frame is in this RDS file: https://rud.is/dl/author-papers.rds which you can read with:
readRDS(url("https://rud.is/dl/author-papers.rds"))
If you do plan on scraping 34,983 papers then please continue to heed to "don't be rude" and use a crawl delay (ref: https://rud.is/b/2017/07/28/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r/).
UPDATE
html_nodes(pg, "a[href^='author']") %>%
html_attr("href") %>%
sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>%
{ pb <<- progress_estimated(length(.)) ; . } %>% # we'll use a progress bar as this will take ~3m
map_df(~{
pb$tick()$print() # increment progress bar
Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay
read_html(.x) %>%
html_nodes("div.item > div.author") %>%
map_df(~{
data_frame(
author = html_text(.x, trim = TRUE),
is_presenting = html_nodes(.x, xpath="../div[@class='papers']") %>%
html_text(trim = TRUE) %>% # retrieve the text of all the "papers"
paste0(collapse=" ") %>% # just in case there are multiple nodes we flatten them into one
grepl("*", ., fixed=TRUE) # make it TRUE if we find the "*"
)
})
}) -> author_with_presenter_status
author_with_presenter_status
## # A tibble: 22,545 x 2
## author is_presenting
## <chr> <lgl>
## 1 Aadahl, Kristopher FALSE
## 2 Aanderud, Zachary T. FALSE
## 3 Abbey, Alyssa TRUE
## 4 Abbott, Dallas H. FALSE
## 5 Abbott Jr., David M. TRUE
## 6 Abbott, Grant FALSE
## 7 Abbott, Jared FALSE
## 8 Abbott, Kathryn A. FALSE
## 9 Abbott, Lon D. FALSE
## 10 Abbott, Mark B. FALSE
## # ... with 22,535 more rows
Which you can also retrieve with:
readRDS(url("https://rud.is/dl/author-presenter.rds"))
Welcome to SO (and kudos on a 👍🏼 first question).
You seem to have gotten super lucky as the robots.txt
for that site has a ton of entries but doesn't try to restrict what you're doing.
We can pull all of the href
s in the alphabet pagination links at the bottom of the page with html_nodes(pg, "a[href^='author']")
. Below grabs all the paper links from all the authors:
library(rvest)
library(tidyverse)
pg <- read_html("https://gsa.confex.com/gsa/2016AM/webprogram/authora.html")
html_nodes(pg, "a[href^='author']") %>%
html_attr("href") %>%
sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>%
{ pb <<- progress_estimated(length(.)) ; . } %>% # we'll use a progress bar as this will take ~3m
map_df(~{
pb$tick()$print() # increment progress bar
Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay
read_html(.x) %>%
html_nodes("div.item > div.author") %>%
map_df(~{
data_frame(
author = html_text(.x, trim = TRUE),
paper = html_nodes(.x, xpath="../div[@class='papers']/a") %>%
html_text(trim = TRUE),
paper_url = html_nodes(.x, xpath="../div[@class='papers']/a") %>%
html_attr("href") %>%
sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .)
)
})
}) -> author_papers
author_papers
## # A tibble: 34,983 x 3
## author paper paper_url
## <chr> <chr> <chr>
## 1 Aadahl, Kristopher 296-5 https://gsa.confex.com/gsa/2016AM/webprogram/Paper283542.html
## 2 Aanderud, Zachary T. 215-11 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286442.html
## 3 Abbey, Alyssa 54-4 https://gsa.confex.com/gsa/2016AM/webprogram/Paper281801.html
## 4 Abbott, Dallas H. 341-34 https://gsa.confex.com/gsa/2016AM/webprogram/Paper287404.html
## 5 Abbott Jr., David M. 38-6 https://gsa.confex.com/gsa/2016AM/webprogram/Paper278060.html
## 6 Abbott, Grant 58-7 https://gsa.confex.com/gsa/2016AM/webprogram/Paper283414.html
## 7 Abbott, Jared 29-10 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286237.html
## 8 Abbott, Jared 317-9 https://gsa.confex.com/gsa/2016AM/webprogram/Paper282386.html
## 9 Abbott, Kathryn A. 187-9 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286127.html
## 10 Abbott, Lon D. 208-16 https://gsa.confex.com/gsa/2016AM/webprogram/Paper280093.html
## # ... with 34,973 more rows
I don't know what you need off of the individual paper pages so you can do that.
You also don't have to wait ~3m since the the author_papers
data frame is in this RDS file: https://rud.is/dl/author-papers.rds which you can read with:
readRDS(url("https://rud.is/dl/author-papers.rds"))
If you do plan on scraping 34,983 papers then please continue to heed to "don't be rude" and use a crawl delay (ref: https://rud.is/b/2017/07/28/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r/).
UPDATE
html_nodes(pg, "a[href^='author']") %>%
html_attr("href") %>%
sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>%
{ pb <<- progress_estimated(length(.)) ; . } %>% # we'll use a progress bar as this will take ~3m
map_df(~{
pb$tick()$print() # increment progress bar
Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay
read_html(.x) %>%
html_nodes("div.item > div.author") %>%
map_df(~{
data_frame(
author = html_text(.x, trim = TRUE),
is_presenting = html_nodes(.x, xpath="../div[@class='papers']") %>%
html_text(trim = TRUE) %>% # retrieve the text of all the "papers"
paste0(collapse=" ") %>% # just in case there are multiple nodes we flatten them into one
grepl("*", ., fixed=TRUE) # make it TRUE if we find the "*"
)
})
}) -> author_with_presenter_status
author_with_presenter_status
## # A tibble: 22,545 x 2
## author is_presenting
## <chr> <lgl>
## 1 Aadahl, Kristopher FALSE
## 2 Aanderud, Zachary T. FALSE
## 3 Abbey, Alyssa TRUE
## 4 Abbott, Dallas H. FALSE
## 5 Abbott Jr., David M. TRUE
## 6 Abbott, Grant FALSE
## 7 Abbott, Jared FALSE
## 8 Abbott, Kathryn A. FALSE
## 9 Abbott, Lon D. FALSE
## 10 Abbott, Mark B. FALSE
## # ... with 22,535 more rows
Which you can also retrieve with:
readRDS(url("https://rud.is/dl/author-presenter.rds"))
edited Nov 25 '18 at 16:07
answered Nov 25 '18 at 15:00
hrbrmstrhrbrmstr
61.3k691152
61.3k691152
Thanks for this! It may be doing a bit more than I need. All I need to do is collect all of the author names from each page from the 2016 conference. Perhaps this does a bit more than I need in terms of scraping. However, I was able to clean up the dataframe with unique() as there are multiple counts of authors who are listed on multiple abstracts.
– morrismc
Nov 25 '18 at 15:45
Ah one more question, if I wanted to note whether or not someone was the presenting author, denoted with an * in the page. Would there be a simply solution to create a new column of True/False for presenting status?
– morrismc
Nov 25 '18 at 15:51
Updated to just get the author and whether they were a presenter.
– hrbrmstr
Nov 25 '18 at 16:07
Woohoo Thanks so much! I really appreciate it :)
– morrismc
Nov 25 '18 at 16:26
add a comment |
Thanks for this! It may be doing a bit more than I need. All I need to do is collect all of the author names from each page from the 2016 conference. Perhaps this does a bit more than I need in terms of scraping. However, I was able to clean up the dataframe with unique() as there are multiple counts of authors who are listed on multiple abstracts.
– morrismc
Nov 25 '18 at 15:45
Ah one more question, if I wanted to note whether or not someone was the presenting author, denoted with an * in the page. Would there be a simply solution to create a new column of True/False for presenting status?
– morrismc
Nov 25 '18 at 15:51
Updated to just get the author and whether they were a presenter.
– hrbrmstr
Nov 25 '18 at 16:07
Woohoo Thanks so much! I really appreciate it :)
– morrismc
Nov 25 '18 at 16:26
Thanks for this! It may be doing a bit more than I need. All I need to do is collect all of the author names from each page from the 2016 conference. Perhaps this does a bit more than I need in terms of scraping. However, I was able to clean up the dataframe with unique() as there are multiple counts of authors who are listed on multiple abstracts.
– morrismc
Nov 25 '18 at 15:45
Thanks for this! It may be doing a bit more than I need. All I need to do is collect all of the author names from each page from the 2016 conference. Perhaps this does a bit more than I need in terms of scraping. However, I was able to clean up the dataframe with unique() as there are multiple counts of authors who are listed on multiple abstracts.
– morrismc
Nov 25 '18 at 15:45
Ah one more question, if I wanted to note whether or not someone was the presenting author, denoted with an * in the page. Would there be a simply solution to create a new column of True/False for presenting status?
– morrismc
Nov 25 '18 at 15:51
Ah one more question, if I wanted to note whether or not someone was the presenting author, denoted with an * in the page. Would there be a simply solution to create a new column of True/False for presenting status?
– morrismc
Nov 25 '18 at 15:51
Updated to just get the author and whether they were a presenter.
– hrbrmstr
Nov 25 '18 at 16:07
Updated to just get the author and whether they were a presenter.
– hrbrmstr
Nov 25 '18 at 16:07
Woohoo Thanks so much! I really appreciate it :)
– morrismc
Nov 25 '18 at 16:26
Woohoo Thanks so much! I really appreciate it :)
– morrismc
Nov 25 '18 at 16:26
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53468576%2flooping-through-alphabetical-pages-rvest%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown