Looping through alphabetical pages (rvest)

After spending a lot of time on this problem, and looking through available answers, I wanted to go ahead and ask a new question to address the problem I have webscraping with R and rvest. I have attempted to fully lay out the problem to minimize questions

The Problem
I am trying to extract the author names from a conference webpage. The authors are separated alphabetically by their last name; hence, I need to use a for loop to call follow_link() 25 times to go to each page and extract the pertinent author text.

The conference website:
https://gsa.confex.com/gsa/2016AM/webprogram/authora.html

I have attempted two solutions in R using rvest, both with problems.

Solution 1 (Letter call to link)

lttrs <- LETTERS[seq( from = 1, to = 26 )] # create character vector

website <-  html_session(https://gsa.confex.com/gsa/2016AM/webprogram/authora.html)



tempList <- list() #create list to store each page's author information



for(i in 1:length(lttrs)){

  tempList[[i]] <- website %>%

  follow_link(lttrs[i])%>% #use capital letters to call links to author pages  

  html_nodes(xpath ='//*[@class = "author"]') %>% 

  html_text()  

}

This code works.. to a point. Below is the output. It will successfully navigate through lettered pages until the H-I transition and the L-M transition at which point it grabs the wrong page.

Navigating to authora.html

Navigating to authorb.html

Navigating to authorc.html

Navigating to authord.html

Navigating to authore.html

Navigating to authorf.html

Navigating to authorg.html

Navigating to authorh.html

Navigating to authora.html

Navigating to authorj.html

Navigating to authork.html

Navigating to authorl.html

Navigating to http://community.geosociety.org/gsa2016/home

Solution 2 (CSS call to link)
Using a CSS selector on the page, each lettered page is identified as an "a:nth-child(1-26)". So I reconstructed my loop using a call for that CSS identifier.

tempList <- list()

for(i in 2:length(lttrs)){

  tempList[[i]] <- website %>%

    follow_link(css = paste('a:nth-child(',i,')',sep = '')) %>%

    html_nodes(xpath ='//*[@class = "author"]') %>% 

    html_text()

}

This works kindof. Again it has trouble with certain transitions (see below)

Navigating to authora.html

Navigating to uploadlistall.html

Navigating to http://community.geosociety.org/gsa2016/home

Navigating to authore.html

Navigating to authorf.html

Navigating to authorg.html

Navigating to authorh.html

Navigating to authori.html

Navigating to authorj.html

Navigating to authork.html

Navigating to authorl.html

Navigating to authorm.html

Navigating to authorn.html

Navigating to authoro.html

Navigating to authorp.html

Navigating to authorq.html

Navigating to authorr.html

Navigating to authors.html

Navigating to authort.html

Navigating to authoru.html

Navigating to authorv.html

Navigating to authorw.html

Navigating to authorx.html

Navigating to authory.html

Navigating to authorz.html

Specifically, this method misses B,C, and D. Looping to the incorrect pages at this step. I would greatly appreciate any insights or directions for how my above code could be reconfigured to correctly loop through all 26 alphabetical pages.

Thank you so much!

asked Nov 25 '18 at 14:37

morrismc

add a comment |

The conference website:
https://gsa.confex.com/gsa/2016AM/webprogram/authora.html

I have attempted two solutions in R using rvest, both with problems.

Solution 1 (Letter call to link)

lttrs <- LETTERS[seq( from = 1, to = 26 )] # create character vector

website <-  html_session(https://gsa.confex.com/gsa/2016AM/webprogram/authora.html)



tempList <- list() #create list to store each page's author information



for(i in 1:length(lttrs)){

  tempList[[i]] <- website %>%

  follow_link(lttrs[i])%>% #use capital letters to call links to author pages  

  html_nodes(xpath ='//*[@class = "author"]') %>% 

  html_text()  

}

This code works.. to a point. Below is the output. It will successfully navigate through lettered pages until the H-I transition and the L-M transition at which point it grabs the wrong page.

Navigating to authora.html

Navigating to authorb.html

Navigating to authorc.html

Navigating to authord.html

Navigating to authore.html

Navigating to authorf.html

Navigating to authorg.html

Navigating to authorh.html

Navigating to authora.html

Navigating to authorj.html

Navigating to authork.html

Navigating to authorl.html

Navigating to http://community.geosociety.org/gsa2016/home

Solution 2 (CSS call to link)
Using a CSS selector on the page, each lettered page is identified as an "a:nth-child(1-26)". So I reconstructed my loop using a call for that CSS identifier.

tempList <- list()

for(i in 2:length(lttrs)){

  tempList[[i]] <- website %>%

    follow_link(css = paste('a:nth-child(',i,')',sep = '')) %>%

    html_nodes(xpath ='//*[@class = "author"]') %>% 

    html_text()

}

This works kindof. Again it has trouble with certain transitions (see below)

Navigating to authora.html

Navigating to uploadlistall.html

Navigating to http://community.geosociety.org/gsa2016/home

Navigating to authore.html

Navigating to authorf.html

Navigating to authorg.html

Navigating to authorh.html

Navigating to authori.html

Navigating to authorj.html

Navigating to authork.html

Navigating to authorl.html

Navigating to authorm.html

Navigating to authorn.html

Navigating to authoro.html

Navigating to authorp.html

Navigating to authorq.html

Navigating to authorr.html

Navigating to authors.html

Navigating to authort.html

Navigating to authoru.html

Navigating to authorv.html

Navigating to authorw.html

Navigating to authorx.html

Navigating to authory.html

Navigating to authorz.html

Thank you so much!

asked Nov 25 '18 at 14:37

morrismc

add a comment |

The conference website:
https://gsa.confex.com/gsa/2016AM/webprogram/authora.html

I have attempted two solutions in R using rvest, both with problems.

Solution 1 (Letter call to link)

lttrs <- LETTERS[seq( from = 1, to = 26 )] # create character vector

website <-  html_session(https://gsa.confex.com/gsa/2016AM/webprogram/authora.html)



tempList <- list() #create list to store each page's author information



for(i in 1:length(lttrs)){

  tempList[[i]] <- website %>%

  follow_link(lttrs[i])%>% #use capital letters to call links to author pages  

  html_nodes(xpath ='//*[@class = "author"]') %>% 

  html_text()  

}

This code works.. to a point. Below is the output. It will successfully navigate through lettered pages until the H-I transition and the L-M transition at which point it grabs the wrong page.

Navigating to authora.html

Navigating to authorb.html

Navigating to authorc.html

Navigating to authord.html

Navigating to authore.html

Navigating to authorf.html

Navigating to authorg.html

Navigating to authorh.html

Navigating to authora.html

Navigating to authorj.html

Navigating to authork.html

Navigating to authorl.html

Navigating to http://community.geosociety.org/gsa2016/home

Solution 2 (CSS call to link)
Using a CSS selector on the page, each lettered page is identified as an "a:nth-child(1-26)". So I reconstructed my loop using a call for that CSS identifier.

tempList <- list()

for(i in 2:length(lttrs)){

  tempList[[i]] <- website %>%

    follow_link(css = paste('a:nth-child(',i,')',sep = '')) %>%

    html_nodes(xpath ='//*[@class = "author"]') %>% 

    html_text()

}

This works kindof. Again it has trouble with certain transitions (see below)

Navigating to authora.html

Navigating to uploadlistall.html

Navigating to http://community.geosociety.org/gsa2016/home

Navigating to authore.html

Navigating to authorf.html

Navigating to authorg.html

Navigating to authorh.html

Navigating to authori.html

Navigating to authorj.html

Navigating to authork.html

Navigating to authorl.html

Navigating to authorm.html

Navigating to authorn.html

Navigating to authoro.html

Navigating to authorp.html

Navigating to authorq.html

Navigating to authorr.html

Navigating to authors.html

Navigating to authort.html

Navigating to authoru.html

Navigating to authorv.html

Navigating to authorw.html

Navigating to authorx.html

Navigating to authory.html

Navigating to authorz.html

Thank you so much!

asked Nov 25 '18 at 14:37

morrismc

The conference website:
https://gsa.confex.com/gsa/2016AM/webprogram/authora.html

I have attempted two solutions in R using rvest, both with problems.

Solution 1 (Letter call to link)

lttrs <- LETTERS[seq( from = 1, to = 26 )] # create character vector

website <-  html_session(https://gsa.confex.com/gsa/2016AM/webprogram/authora.html)



tempList <- list() #create list to store each page's author information



for(i in 1:length(lttrs)){

  tempList[[i]] <- website %>%

  follow_link(lttrs[i])%>% #use capital letters to call links to author pages  

  html_nodes(xpath ='//*[@class = "author"]') %>% 

  html_text()  

}

This code works.. to a point. Below is the output. It will successfully navigate through lettered pages until the H-I transition and the L-M transition at which point it grabs the wrong page.

Navigating to authora.html

Navigating to authorb.html

Navigating to authorc.html

Navigating to authord.html

Navigating to authore.html

Navigating to authorf.html

Navigating to authorg.html

Navigating to authorh.html

Navigating to authora.html

Navigating to authorj.html

Navigating to authork.html

Navigating to authorl.html

Navigating to http://community.geosociety.org/gsa2016/home

Solution 2 (CSS call to link)
Using a CSS selector on the page, each lettered page is identified as an "a:nth-child(1-26)". So I reconstructed my loop using a call for that CSS identifier.

tempList <- list()

for(i in 2:length(lttrs)){

  tempList[[i]] <- website %>%

    follow_link(css = paste('a:nth-child(',i,')',sep = '')) %>%

    html_nodes(xpath ='//*[@class = "author"]') %>% 

    html_text()

}

This works kindof. Again it has trouble with certain transitions (see below)

Navigating to authora.html

Navigating to uploadlistall.html

Navigating to http://community.geosociety.org/gsa2016/home

Navigating to authore.html

Navigating to authorf.html

Navigating to authorg.html

Navigating to authorh.html

Navigating to authori.html

Navigating to authorj.html

Navigating to authork.html

Navigating to authorl.html

Navigating to authorm.html

Navigating to authorn.html

Navigating to authoro.html

Navigating to authorp.html

Navigating to authorq.html

Navigating to authorr.html

Navigating to authors.html

Navigating to authort.html

Navigating to authoru.html

Navigating to authorv.html

Navigating to authorw.html

Navigating to authorx.html

Navigating to authory.html

Navigating to authorz.html

Thank you so much!

css r web-scraping rvest

asked Nov 25 '18 at 14:37

morrismc

asked Nov 25 '18 at 14:37

morrismc

asked Nov 25 '18 at 14:37

morrismc

asked Nov 25 '18 at 14:37

morrismc

asked Nov 25 '18 at 14:37

morrismc

add a comment |

1 Answer
1

active

oldest

votes

Welcome to SO (and kudos on a 👍🏼 first question).

You seem to have gotten super lucky as the robots.txt for that site has a ton of entries but doesn't try to restrict what you're doing.

We can pull all of the hrefs in the alphabet pagination links at the bottom of the page with html_nodes(pg, "a[href^='author']"). Below grabs all the paper links from all the authors:

library(rvest)

library(tidyverse)



pg <- read_html("https://gsa.confex.com/gsa/2016AM/webprogram/authora.html")



html_nodes(pg, "a[href^='author']") %>% 

  html_attr("href") %>% 

  sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>% 

  { pb <<- progress_estimated(length(.)) ; . } %>%  # we'll use a progress bar as this will take ~3m

  map_df(~{



    pb$tick()$print() # increment progress bar



    Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay



    read_html(.x) %>% 

      html_nodes("div.item > div.author") %>% 

      map_df(~{

        data_frame(

          author = html_text(.x, trim = TRUE),

          paper = html_nodes(.x, xpath="../div[@class='papers']/a") %>% 

            html_text(trim = TRUE),

          paper_url = html_nodes(.x, xpath="../div[@class='papers']/a") %>% 

            html_attr("href") %>% 

            sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .)

        )

      })

  }) -> author_papers



author_papers

## # A tibble: 34,983 x 3

##    author               paper  paper_url                                                    

##    <chr>                <chr>  <chr>                                                        

##  1 Aadahl, Kristopher   296-5  https://gsa.confex.com/gsa/2016AM/webprogram/Paper283542.html

##  2 Aanderud, Zachary T. 215-11 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286442.html

##  3 Abbey, Alyssa        54-4   https://gsa.confex.com/gsa/2016AM/webprogram/Paper281801.html

##  4 Abbott, Dallas H.    341-34 https://gsa.confex.com/gsa/2016AM/webprogram/Paper287404.html

##  5 Abbott Jr., David M. 38-6   https://gsa.confex.com/gsa/2016AM/webprogram/Paper278060.html

##  6 Abbott, Grant        58-7   https://gsa.confex.com/gsa/2016AM/webprogram/Paper283414.html

##  7 Abbott, Jared        29-10  https://gsa.confex.com/gsa/2016AM/webprogram/Paper286237.html

##  8 Abbott, Jared        317-9  https://gsa.confex.com/gsa/2016AM/webprogram/Paper282386.html

##  9 Abbott, Kathryn A.   187-9  https://gsa.confex.com/gsa/2016AM/webprogram/Paper286127.html

## 10 Abbott, Lon D.       208-16 https://gsa.confex.com/gsa/2016AM/webprogram/Paper280093.html

## # ... with 34,973 more rows

I don't know what you need off of the individual paper pages so you can do that.

You also don't have to wait ~3m since the the author_papers data frame is in this RDS file: https://rud.is/dl/author-papers.rds which you can read with:

readRDS(url("https://rud.is/dl/author-papers.rds"))

If you do plan on scraping 34,983 papers then please continue to heed to "don't be rude" and use a crawl delay (ref: https://rud.is/b/2017/07/28/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r/).

UPDATE

html_nodes(pg, "a[href^='author']") %>% 

  html_attr("href") %>% 

  sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>% 

  { pb <<- progress_estimated(length(.)) ; . } %>%  # we'll use a progress bar as this will take ~3m

  map_df(~{



    pb$tick()$print() # increment progress bar



    Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay



    read_html(.x) %>% 

      html_nodes("div.item > div.author") %>% 

      map_df(~{

        data_frame(

          author = html_text(.x, trim = TRUE),

          is_presenting = html_nodes(.x, xpath="../div[@class='papers']") %>% 

            html_text(trim = TRUE) %>% # retrieve the text of all the "papers"

            paste0(collapse=" ") %>% # just in case there are multiple nodes we flatten them into one

            grepl("*", ., fixed=TRUE) # make it TRUE if we find the "*" 

        )

      })

  }) -> author_with_presenter_status



author_with_presenter_status

## # A tibble: 22,545 x 2

##    author               is_presenting

##    <chr>                <lgl>        

##  1 Aadahl, Kristopher   FALSE        

##  2 Aanderud, Zachary T. FALSE        

##  3 Abbey, Alyssa        TRUE         

##  4 Abbott, Dallas H.    FALSE        

##  5 Abbott Jr., David M. TRUE         

##  6 Abbott, Grant        FALSE        

##  7 Abbott, Jared        FALSE        

##  8 Abbott, Kathryn A.   FALSE        

##  9 Abbott, Lon D.       FALSE        

## 10 Abbott, Mark B.      FALSE        

## # ... with 22,535 more rows

Which you can also retrieve with:

readRDS(url("https://rud.is/dl/author-presenter.rds"))

edited Nov 25 '18 at 16:07

answered Nov 25 '18 at 15:00

hrbrmstr

61.3k691152

Thanks for this! It may be doing a bit more than I need. All I need to do is collect all of the author names from each page from the 2016 conference. Perhaps this does a bit more than I need in terms of scraping. However, I was able to clean up the dataframe with unique() as there are multiple counts of authors who are listed on multiple abstracts.

– morrismc
Nov 25 '18 at 15:45

Ah one more question, if I wanted to note whether or not someone was the presenting author, denoted with an * in the page. Would there be a simply solution to create a new column of True/False for presenting status?

– morrismc
Nov 25 '18 at 15:51

Updated to just get the author and whether they were a presenter.

– hrbrmstr
Nov 25 '18 at 16:07

Woohoo Thanks so much! I really appreciate it :)

– morrismc
Nov 25 '18 at 16:26

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53468576%2flooping-through-alphabetical-pages-rvest%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Welcome to SO (and kudos on a 👍🏼 first question).

You seem to have gotten super lucky as the robots.txt for that site has a ton of entries but doesn't try to restrict what you're doing.

We can pull all of the hrefs in the alphabet pagination links at the bottom of the page with html_nodes(pg, "a[href^='author']"). Below grabs all the paper links from all the authors:

library(rvest)

library(tidyverse)



pg <- read_html("https://gsa.confex.com/gsa/2016AM/webprogram/authora.html")



html_nodes(pg, "a[href^='author']") %>% 

  html_attr("href") %>% 

  sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>% 

  { pb <<- progress_estimated(length(.)) ; . } %>%  # we'll use a progress bar as this will take ~3m

  map_df(~{



    pb$tick()$print() # increment progress bar



    Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay



    read_html(.x) %>% 

      html_nodes("div.item > div.author") %>% 

      map_df(~{

        data_frame(

          author = html_text(.x, trim = TRUE),

          paper = html_nodes(.x, xpath="../div[@class='papers']/a") %>% 

            html_text(trim = TRUE),

          paper_url = html_nodes(.x, xpath="../div[@class='papers']/a") %>% 

            html_attr("href") %>% 

            sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .)

        )

      })

  }) -> author_papers



author_papers

## # A tibble: 34,983 x 3

##    author               paper  paper_url                                                    

##    <chr>                <chr>  <chr>                                                        

##  1 Aadahl, Kristopher   296-5  https://gsa.confex.com/gsa/2016AM/webprogram/Paper283542.html

##  2 Aanderud, Zachary T. 215-11 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286442.html

##  3 Abbey, Alyssa        54-4   https://gsa.confex.com/gsa/2016AM/webprogram/Paper281801.html

##  4 Abbott, Dallas H.    341-34 https://gsa.confex.com/gsa/2016AM/webprogram/Paper287404.html

##  5 Abbott Jr., David M. 38-6   https://gsa.confex.com/gsa/2016AM/webprogram/Paper278060.html

##  6 Abbott, Grant        58-7   https://gsa.confex.com/gsa/2016AM/webprogram/Paper283414.html

##  7 Abbott, Jared        29-10  https://gsa.confex.com/gsa/2016AM/webprogram/Paper286237.html

##  8 Abbott, Jared        317-9  https://gsa.confex.com/gsa/2016AM/webprogram/Paper282386.html

##  9 Abbott, Kathryn A.   187-9  https://gsa.confex.com/gsa/2016AM/webprogram/Paper286127.html

## 10 Abbott, Lon D.       208-16 https://gsa.confex.com/gsa/2016AM/webprogram/Paper280093.html

## # ... with 34,973 more rows

I don't know what you need off of the individual paper pages so you can do that.

You also don't have to wait ~3m since the the author_papers data frame is in this RDS file: https://rud.is/dl/author-papers.rds which you can read with:

readRDS(url("https://rud.is/dl/author-papers.rds"))

UPDATE

html_nodes(pg, "a[href^='author']") %>% 

  html_attr("href") %>% 

  sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>% 

  { pb <<- progress_estimated(length(.)) ; . } %>%  # we'll use a progress bar as this will take ~3m

  map_df(~{



    pb$tick()$print() # increment progress bar



    Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay



    read_html(.x) %>% 

      html_nodes("div.item > div.author") %>% 

      map_df(~{

        data_frame(

          author = html_text(.x, trim = TRUE),

          is_presenting = html_nodes(.x, xpath="../div[@class='papers']") %>% 

            html_text(trim = TRUE) %>% # retrieve the text of all the "papers"

            paste0(collapse=" ") %>% # just in case there are multiple nodes we flatten them into one

            grepl("*", ., fixed=TRUE) # make it TRUE if we find the "*" 

        )

      })

  }) -> author_with_presenter_status



author_with_presenter_status

## # A tibble: 22,545 x 2

##    author               is_presenting

##    <chr>                <lgl>        

##  1 Aadahl, Kristopher   FALSE        

##  2 Aanderud, Zachary T. FALSE        

##  3 Abbey, Alyssa        TRUE         

##  4 Abbott, Dallas H.    FALSE        

##  5 Abbott Jr., David M. TRUE         

##  6 Abbott, Grant        FALSE        

##  7 Abbott, Jared        FALSE        

##  8 Abbott, Kathryn A.   FALSE        

##  9 Abbott, Lon D.       FALSE        

## 10 Abbott, Mark B.      FALSE        

## # ... with 22,535 more rows

Which you can also retrieve with:

readRDS(url("https://rud.is/dl/author-presenter.rds"))

edited Nov 25 '18 at 16:07

answered Nov 25 '18 at 15:00

hrbrmstr

61.3k691152

Thanks for this! It may be doing a bit more than I need. All I need to do is collect all of the author names from each page from the 2016 conference. Perhaps this does a bit more than I need in terms of scraping. However, I was able to clean up the dataframe with unique() as there are multiple counts of authors who are listed on multiple abstracts.

– morrismc
Nov 25 '18 at 15:45

Ah one more question, if I wanted to note whether or not someone was the presenting author, denoted with an * in the page. Would there be a simply solution to create a new column of True/False for presenting status?

– morrismc
Nov 25 '18 at 15:51

Updated to just get the author and whether they were a presenter.

– hrbrmstr
Nov 25 '18 at 16:07

Woohoo Thanks so much! I really appreciate it :)

– morrismc
Nov 25 '18 at 16:26

add a comment |

Welcome to SO (and kudos on a 👍🏼 first question).

You seem to have gotten super lucky as the robots.txt for that site has a ton of entries but doesn't try to restrict what you're doing.

We can pull all of the hrefs in the alphabet pagination links at the bottom of the page with html_nodes(pg, "a[href^='author']"). Below grabs all the paper links from all the authors:

library(rvest)

library(tidyverse)



pg <- read_html("https://gsa.confex.com/gsa/2016AM/webprogram/authora.html")



html_nodes(pg, "a[href^='author']") %>% 

  html_attr("href") %>% 

  sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>% 

  { pb <<- progress_estimated(length(.)) ; . } %>%  # we'll use a progress bar as this will take ~3m

  map_df(~{



    pb$tick()$print() # increment progress bar



    Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay



    read_html(.x) %>% 

      html_nodes("div.item > div.author") %>% 

      map_df(~{

        data_frame(

          author = html_text(.x, trim = TRUE),

          paper = html_nodes(.x, xpath="../div[@class='papers']/a") %>% 

            html_text(trim = TRUE),

          paper_url = html_nodes(.x, xpath="../div[@class='papers']/a") %>% 

            html_attr("href") %>% 

            sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .)

        )

      })

  }) -> author_papers



author_papers

## # A tibble: 34,983 x 3

##    author               paper  paper_url                                                    

##    <chr>                <chr>  <chr>                                                        

##  1 Aadahl, Kristopher   296-5  https://gsa.confex.com/gsa/2016AM/webprogram/Paper283542.html

##  2 Aanderud, Zachary T. 215-11 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286442.html

##  3 Abbey, Alyssa        54-4   https://gsa.confex.com/gsa/2016AM/webprogram/Paper281801.html

##  4 Abbott, Dallas H.    341-34 https://gsa.confex.com/gsa/2016AM/webprogram/Paper287404.html

##  5 Abbott Jr., David M. 38-6   https://gsa.confex.com/gsa/2016AM/webprogram/Paper278060.html

##  6 Abbott, Grant        58-7   https://gsa.confex.com/gsa/2016AM/webprogram/Paper283414.html

##  7 Abbott, Jared        29-10  https://gsa.confex.com/gsa/2016AM/webprogram/Paper286237.html

##  8 Abbott, Jared        317-9  https://gsa.confex.com/gsa/2016AM/webprogram/Paper282386.html

##  9 Abbott, Kathryn A.   187-9  https://gsa.confex.com/gsa/2016AM/webprogram/Paper286127.html

## 10 Abbott, Lon D.       208-16 https://gsa.confex.com/gsa/2016AM/webprogram/Paper280093.html

## # ... with 34,973 more rows

I don't know what you need off of the individual paper pages so you can do that.

You also don't have to wait ~3m since the the author_papers data frame is in this RDS file: https://rud.is/dl/author-papers.rds which you can read with:

readRDS(url("https://rud.is/dl/author-papers.rds"))

UPDATE

html_nodes(pg, "a[href^='author']") %>% 

  html_attr("href") %>% 

  sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>% 

  { pb <<- progress_estimated(length(.)) ; . } %>%  # we'll use a progress bar as this will take ~3m

  map_df(~{



    pb$tick()$print() # increment progress bar



    Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay



    read_html(.x) %>% 

      html_nodes("div.item > div.author") %>% 

      map_df(~{

        data_frame(

          author = html_text(.x, trim = TRUE),

          is_presenting = html_nodes(.x, xpath="../div[@class='papers']") %>% 

            html_text(trim = TRUE) %>% # retrieve the text of all the "papers"

            paste0(collapse=" ") %>% # just in case there are multiple nodes we flatten them into one

            grepl("*", ., fixed=TRUE) # make it TRUE if we find the "*" 

        )

      })

  }) -> author_with_presenter_status



author_with_presenter_status

## # A tibble: 22,545 x 2

##    author               is_presenting

##    <chr>                <lgl>        

##  1 Aadahl, Kristopher   FALSE        

##  2 Aanderud, Zachary T. FALSE        

##  3 Abbey, Alyssa        TRUE         

##  4 Abbott, Dallas H.    FALSE        

##  5 Abbott Jr., David M. TRUE         

##  6 Abbott, Grant        FALSE        

##  7 Abbott, Jared        FALSE        

##  8 Abbott, Kathryn A.   FALSE        

##  9 Abbott, Lon D.       FALSE        

## 10 Abbott, Mark B.      FALSE        

## # ... with 22,535 more rows

Which you can also retrieve with:

readRDS(url("https://rud.is/dl/author-presenter.rds"))

edited Nov 25 '18 at 16:07

answered Nov 25 '18 at 15:00

hrbrmstr

61.3k691152

Thanks for this! It may be doing a bit more than I need. All I need to do is collect all of the author names from each page from the 2016 conference. Perhaps this does a bit more than I need in terms of scraping. However, I was able to clean up the dataframe with unique() as there are multiple counts of authors who are listed on multiple abstracts.

– morrismc
Nov 25 '18 at 15:45

Ah one more question, if I wanted to note whether or not someone was the presenting author, denoted with an * in the page. Would there be a simply solution to create a new column of True/False for presenting status?

– morrismc
Nov 25 '18 at 15:51

Updated to just get the author and whether they were a presenter.

– hrbrmstr
Nov 25 '18 at 16:07

Woohoo Thanks so much! I really appreciate it :)

– morrismc
Nov 25 '18 at 16:26

add a comment |

Welcome to SO (and kudos on a 👍🏼 first question).

You seem to have gotten super lucky as the robots.txt for that site has a ton of entries but doesn't try to restrict what you're doing.

We can pull all of the hrefs in the alphabet pagination links at the bottom of the page with html_nodes(pg, "a[href^='author']"). Below grabs all the paper links from all the authors:

library(rvest)

library(tidyverse)



pg <- read_html("https://gsa.confex.com/gsa/2016AM/webprogram/authora.html")



html_nodes(pg, "a[href^='author']") %>% 

  html_attr("href") %>% 

  sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>% 

  { pb <<- progress_estimated(length(.)) ; . } %>%  # we'll use a progress bar as this will take ~3m

  map_df(~{



    pb$tick()$print() # increment progress bar



    Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay



    read_html(.x) %>% 

      html_nodes("div.item > div.author") %>% 

      map_df(~{

        data_frame(

          author = html_text(.x, trim = TRUE),

          paper = html_nodes(.x, xpath="../div[@class='papers']/a") %>% 

            html_text(trim = TRUE),

          paper_url = html_nodes(.x, xpath="../div[@class='papers']/a") %>% 

            html_attr("href") %>% 

            sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .)

        )

      })

  }) -> author_papers



author_papers

## # A tibble: 34,983 x 3

##    author               paper  paper_url                                                    

##    <chr>                <chr>  <chr>                                                        

##  1 Aadahl, Kristopher   296-5  https://gsa.confex.com/gsa/2016AM/webprogram/Paper283542.html

##  2 Aanderud, Zachary T. 215-11 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286442.html

##  3 Abbey, Alyssa        54-4   https://gsa.confex.com/gsa/2016AM/webprogram/Paper281801.html

##  4 Abbott, Dallas H.    341-34 https://gsa.confex.com/gsa/2016AM/webprogram/Paper287404.html

##  5 Abbott Jr., David M. 38-6   https://gsa.confex.com/gsa/2016AM/webprogram/Paper278060.html

##  6 Abbott, Grant        58-7   https://gsa.confex.com/gsa/2016AM/webprogram/Paper283414.html

##  7 Abbott, Jared        29-10  https://gsa.confex.com/gsa/2016AM/webprogram/Paper286237.html

##  8 Abbott, Jared        317-9  https://gsa.confex.com/gsa/2016AM/webprogram/Paper282386.html

##  9 Abbott, Kathryn A.   187-9  https://gsa.confex.com/gsa/2016AM/webprogram/Paper286127.html

## 10 Abbott, Lon D.       208-16 https://gsa.confex.com/gsa/2016AM/webprogram/Paper280093.html

## # ... with 34,973 more rows

I don't know what you need off of the individual paper pages so you can do that.

You also don't have to wait ~3m since the the author_papers data frame is in this RDS file: https://rud.is/dl/author-papers.rds which you can read with:

readRDS(url("https://rud.is/dl/author-papers.rds"))

UPDATE

html_nodes(pg, "a[href^='author']") %>% 

  html_attr("href") %>% 

  sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>% 

  { pb <<- progress_estimated(length(.)) ; . } %>%  # we'll use a progress bar as this will take ~3m

  map_df(~{



    pb$tick()$print() # increment progress bar



    Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay



    read_html(.x) %>% 

      html_nodes("div.item > div.author") %>% 

      map_df(~{

        data_frame(

          author = html_text(.x, trim = TRUE),

          is_presenting = html_nodes(.x, xpath="../div[@class='papers']") %>% 

            html_text(trim = TRUE) %>% # retrieve the text of all the "papers"

            paste0(collapse=" ") %>% # just in case there are multiple nodes we flatten them into one

            grepl("*", ., fixed=TRUE) # make it TRUE if we find the "*" 

        )

      })

  }) -> author_with_presenter_status



author_with_presenter_status

## # A tibble: 22,545 x 2

##    author               is_presenting

##    <chr>                <lgl>        

##  1 Aadahl, Kristopher   FALSE        

##  2 Aanderud, Zachary T. FALSE        

##  3 Abbey, Alyssa        TRUE         

##  4 Abbott, Dallas H.    FALSE        

##  5 Abbott Jr., David M. TRUE         

##  6 Abbott, Grant        FALSE        

##  7 Abbott, Jared        FALSE        

##  8 Abbott, Kathryn A.   FALSE        

##  9 Abbott, Lon D.       FALSE        

## 10 Abbott, Mark B.      FALSE        

## # ... with 22,535 more rows

Which you can also retrieve with:

readRDS(url("https://rud.is/dl/author-presenter.rds"))

edited Nov 25 '18 at 16:07

answered Nov 25 '18 at 15:00

hrbrmstr

61.3k691152

Welcome to SO (and kudos on a 👍🏼 first question).

You seem to have gotten super lucky as the robots.txt for that site has a ton of entries but doesn't try to restrict what you're doing.

We can pull all of the hrefs in the alphabet pagination links at the bottom of the page with html_nodes(pg, "a[href^='author']"). Below grabs all the paper links from all the authors:

library(rvest)

library(tidyverse)



pg <- read_html("https://gsa.confex.com/gsa/2016AM/webprogram/authora.html")



html_nodes(pg, "a[href^='author']") %>% 

  html_attr("href") %>% 

  sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>% 

  { pb <<- progress_estimated(length(.)) ; . } %>%  # we'll use a progress bar as this will take ~3m

  map_df(~{



    pb$tick()$print() # increment progress bar



    Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay



    read_html(.x) %>% 

      html_nodes("div.item > div.author") %>% 

      map_df(~{

        data_frame(

          author = html_text(.x, trim = TRUE),

          paper = html_nodes(.x, xpath="../div[@class='papers']/a") %>% 

            html_text(trim = TRUE),

          paper_url = html_nodes(.x, xpath="../div[@class='papers']/a") %>% 

            html_attr("href") %>% 

            sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .)

        )

      })

  }) -> author_papers



author_papers

## # A tibble: 34,983 x 3

##    author               paper  paper_url                                                    

##    <chr>                <chr>  <chr>                                                        

##  1 Aadahl, Kristopher   296-5  https://gsa.confex.com/gsa/2016AM/webprogram/Paper283542.html

##  2 Aanderud, Zachary T. 215-11 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286442.html

##  3 Abbey, Alyssa        54-4   https://gsa.confex.com/gsa/2016AM/webprogram/Paper281801.html

##  4 Abbott, Dallas H.    341-34 https://gsa.confex.com/gsa/2016AM/webprogram/Paper287404.html

##  5 Abbott Jr., David M. 38-6   https://gsa.confex.com/gsa/2016AM/webprogram/Paper278060.html

##  6 Abbott, Grant        58-7   https://gsa.confex.com/gsa/2016AM/webprogram/Paper283414.html

##  7 Abbott, Jared        29-10  https://gsa.confex.com/gsa/2016AM/webprogram/Paper286237.html

##  8 Abbott, Jared        317-9  https://gsa.confex.com/gsa/2016AM/webprogram/Paper282386.html

##  9 Abbott, Kathryn A.   187-9  https://gsa.confex.com/gsa/2016AM/webprogram/Paper286127.html

## 10 Abbott, Lon D.       208-16 https://gsa.confex.com/gsa/2016AM/webprogram/Paper280093.html

## # ... with 34,973 more rows

I don't know what you need off of the individual paper pages so you can do that.

You also don't have to wait ~3m since the the author_papers data frame is in this RDS file: https://rud.is/dl/author-papers.rds which you can read with:

readRDS(url("https://rud.is/dl/author-papers.rds"))

UPDATE

html_nodes(pg, "a[href^='author']") %>% 

  html_attr("href") %>% 

  sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>% 

  { pb <<- progress_estimated(length(.)) ; . } %>%  # we'll use a progress bar as this will take ~3m

  map_df(~{



    pb$tick()$print() # increment progress bar



    Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay



    read_html(.x) %>% 

      html_nodes("div.item > div.author") %>% 

      map_df(~{

        data_frame(

          author = html_text(.x, trim = TRUE),

          is_presenting = html_nodes(.x, xpath="../div[@class='papers']") %>% 

            html_text(trim = TRUE) %>% # retrieve the text of all the "papers"

            paste0(collapse=" ") %>% # just in case there are multiple nodes we flatten them into one

            grepl("*", ., fixed=TRUE) # make it TRUE if we find the "*" 

        )

      })

  }) -> author_with_presenter_status



author_with_presenter_status

## # A tibble: 22,545 x 2

##    author               is_presenting

##    <chr>                <lgl>        

##  1 Aadahl, Kristopher   FALSE        

##  2 Aanderud, Zachary T. FALSE        

##  3 Abbey, Alyssa        TRUE         

##  4 Abbott, Dallas H.    FALSE        

##  5 Abbott Jr., David M. TRUE         

##  6 Abbott, Grant        FALSE        

##  7 Abbott, Jared        FALSE        

##  8 Abbott, Kathryn A.   FALSE        

##  9 Abbott, Lon D.       FALSE        

## 10 Abbott, Mark B.      FALSE        

## # ... with 22,535 more rows

Which you can also retrieve with:

readRDS(url("https://rud.is/dl/author-presenter.rds"))

edited Nov 25 '18 at 16:07

answered Nov 25 '18 at 15:00

hrbrmstr

61.3k691152

edited Nov 25 '18 at 16:07

answered Nov 25 '18 at 15:00

hrbrmstr

61.3k691152

answered Nov 25 '18 at 15:00

hrbrmstr

61.3k691152

answered Nov 25 '18 at 15:00

hrbrmstr

61.3k691152

Thanks for this! It may be doing a bit more than I need. All I need to do is collect all of the author names from each page from the 2016 conference. Perhaps this does a bit more than I need in terms of scraping. However, I was able to clean up the dataframe with unique() as there are multiple counts of authors who are listed on multiple abstracts.

– morrismc
Nov 25 '18 at 15:45

Ah one more question, if I wanted to note whether or not someone was the presenting author, denoted with an * in the page. Would there be a simply solution to create a new column of True/False for presenting status?

– morrismc
Nov 25 '18 at 15:51

Updated to just get the author and whether they were a presenter.

– hrbrmstr
Nov 25 '18 at 16:07

Woohoo Thanks so much! I really appreciate it :)

– morrismc
Nov 25 '18 at 16:26

add a comment |

Thanks for this! It may be doing a bit more than I need. All I need to do is collect all of the author names from each page from the 2016 conference. Perhaps this does a bit more than I need in terms of scraping. However, I was able to clean up the dataframe with unique() as there are multiple counts of authors who are listed on multiple abstracts.

– morrismc
Nov 25 '18 at 15:45

Ah one more question, if I wanted to note whether or not someone was the presenting author, denoted with an * in the page. Would there be a simply solution to create a new column of True/False for presenting status?

– morrismc
Nov 25 '18 at 15:51

Updated to just get the author and whether they were a presenter.

– hrbrmstr
Nov 25 '18 at 16:07

Woohoo Thanks so much! I really appreciate it :)

– morrismc
Nov 25 '18 at 16:26

Thanks for this! It may be doing a bit more than I need. All I need to do is collect all of the author names from each page from the 2016 conference. Perhaps this does a bit more than I need in terms of scraping. However, I was able to clean up the dataframe with unique() as there are multiple counts of authors who are listed on multiple abstracts.

– morrismc
Nov 25 '18 at 15:45

Ah one more question, if I wanted to note whether or not someone was the presenting author, denoted with an * in the page. Would there be a simply solution to create a new column of True/False for presenting status?

– morrismc
Nov 25 '18 at 15:51

Updated to just get the author and whether they were a presenter.

– hrbrmstr
Nov 25 '18 at 16:07

Woohoo Thanks so much! I really appreciate it :)

– morrismc
Nov 25 '18 at 16:26

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Tukukkk