Looping through alphabetical pages (rvest)












1















After spending a lot of time on this problem, and looking through available answers, I wanted to go ahead and ask a new question to address the problem I have webscraping with R and rvest. I have attempted to fully lay out the problem to minimize questions



The Problem
I am trying to extract the author names from a conference webpage. The authors are separated alphabetically by their last name; hence, I need to use a for loop to call follow_link() 25 times to go to each page and extract the pertinent author text.



The conference website:
https://gsa.confex.com/gsa/2016AM/webprogram/authora.html



I have attempted two solutions in R using rvest, both with problems.



Solution 1 (Letter call to link)



lttrs <- LETTERS[seq( from = 1, to = 26 )] # create character vector
website <- html_session(https://gsa.confex.com/gsa/2016AM/webprogram/authora.html)

tempList <- list() #create list to store each page's author information

for(i in 1:length(lttrs)){
tempList[[i]] <- website %>%
follow_link(lttrs[i])%>% #use capital letters to call links to author pages
html_nodes(xpath ='//*[@class = "author"]') %>%
html_text()
}


This code works.. to a point. Below is the output. It will successfully navigate through lettered pages until the H-I transition and the L-M transition at which point it grabs the wrong page.



Navigating to authora.html
Navigating to authorb.html
Navigating to authorc.html
Navigating to authord.html
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authora.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to http://community.geosociety.org/gsa2016/home


Solution 2 (CSS call to link)
Using a CSS selector on the page, each lettered page is identified as an "a:nth-child(1-26)". So I reconstructed my loop using a call for that CSS identifier.



tempList <- list()
for(i in 2:length(lttrs)){
tempList[[i]] <- website %>%
follow_link(css = paste('a:nth-child(',i,')',sep = '')) %>%
html_nodes(xpath ='//*[@class = "author"]') %>%
html_text()
}


This works kindof. Again it has trouble with certain transitions (see below)



Navigating to authora.html
Navigating to uploadlistall.html
Navigating to http://community.geosociety.org/gsa2016/home
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authori.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to authorm.html
Navigating to authorn.html
Navigating to authoro.html
Navigating to authorp.html
Navigating to authorq.html
Navigating to authorr.html
Navigating to authors.html
Navigating to authort.html
Navigating to authoru.html
Navigating to authorv.html
Navigating to authorw.html
Navigating to authorx.html
Navigating to authory.html
Navigating to authorz.html


Specifically, this method misses B,C, and D. Looping to the incorrect pages at this step. I would greatly appreciate any insights or directions for how my above code could be reconfigured to correctly loop through all 26 alphabetical pages.



Thank you so much!










share|improve this question



























    1















    After spending a lot of time on this problem, and looking through available answers, I wanted to go ahead and ask a new question to address the problem I have webscraping with R and rvest. I have attempted to fully lay out the problem to minimize questions



    The Problem
    I am trying to extract the author names from a conference webpage. The authors are separated alphabetically by their last name; hence, I need to use a for loop to call follow_link() 25 times to go to each page and extract the pertinent author text.



    The conference website:
    https://gsa.confex.com/gsa/2016AM/webprogram/authora.html



    I have attempted two solutions in R using rvest, both with problems.



    Solution 1 (Letter call to link)



    lttrs <- LETTERS[seq( from = 1, to = 26 )] # create character vector
    website <- html_session(https://gsa.confex.com/gsa/2016AM/webprogram/authora.html)

    tempList <- list() #create list to store each page's author information

    for(i in 1:length(lttrs)){
    tempList[[i]] <- website %>%
    follow_link(lttrs[i])%>% #use capital letters to call links to author pages
    html_nodes(xpath ='//*[@class = "author"]') %>%
    html_text()
    }


    This code works.. to a point. Below is the output. It will successfully navigate through lettered pages until the H-I transition and the L-M transition at which point it grabs the wrong page.



    Navigating to authora.html
    Navigating to authorb.html
    Navigating to authorc.html
    Navigating to authord.html
    Navigating to authore.html
    Navigating to authorf.html
    Navigating to authorg.html
    Navigating to authorh.html
    Navigating to authora.html
    Navigating to authorj.html
    Navigating to authork.html
    Navigating to authorl.html
    Navigating to http://community.geosociety.org/gsa2016/home


    Solution 2 (CSS call to link)
    Using a CSS selector on the page, each lettered page is identified as an "a:nth-child(1-26)". So I reconstructed my loop using a call for that CSS identifier.



    tempList <- list()
    for(i in 2:length(lttrs)){
    tempList[[i]] <- website %>%
    follow_link(css = paste('a:nth-child(',i,')',sep = '')) %>%
    html_nodes(xpath ='//*[@class = "author"]') %>%
    html_text()
    }


    This works kindof. Again it has trouble with certain transitions (see below)



    Navigating to authora.html
    Navigating to uploadlistall.html
    Navigating to http://community.geosociety.org/gsa2016/home
    Navigating to authore.html
    Navigating to authorf.html
    Navigating to authorg.html
    Navigating to authorh.html
    Navigating to authori.html
    Navigating to authorj.html
    Navigating to authork.html
    Navigating to authorl.html
    Navigating to authorm.html
    Navigating to authorn.html
    Navigating to authoro.html
    Navigating to authorp.html
    Navigating to authorq.html
    Navigating to authorr.html
    Navigating to authors.html
    Navigating to authort.html
    Navigating to authoru.html
    Navigating to authorv.html
    Navigating to authorw.html
    Navigating to authorx.html
    Navigating to authory.html
    Navigating to authorz.html


    Specifically, this method misses B,C, and D. Looping to the incorrect pages at this step. I would greatly appreciate any insights or directions for how my above code could be reconfigured to correctly loop through all 26 alphabetical pages.



    Thank you so much!










    share|improve this question

























      1












      1








      1








      After spending a lot of time on this problem, and looking through available answers, I wanted to go ahead and ask a new question to address the problem I have webscraping with R and rvest. I have attempted to fully lay out the problem to minimize questions



      The Problem
      I am trying to extract the author names from a conference webpage. The authors are separated alphabetically by their last name; hence, I need to use a for loop to call follow_link() 25 times to go to each page and extract the pertinent author text.



      The conference website:
      https://gsa.confex.com/gsa/2016AM/webprogram/authora.html



      I have attempted two solutions in R using rvest, both with problems.



      Solution 1 (Letter call to link)



      lttrs <- LETTERS[seq( from = 1, to = 26 )] # create character vector
      website <- html_session(https://gsa.confex.com/gsa/2016AM/webprogram/authora.html)

      tempList <- list() #create list to store each page's author information

      for(i in 1:length(lttrs)){
      tempList[[i]] <- website %>%
      follow_link(lttrs[i])%>% #use capital letters to call links to author pages
      html_nodes(xpath ='//*[@class = "author"]') %>%
      html_text()
      }


      This code works.. to a point. Below is the output. It will successfully navigate through lettered pages until the H-I transition and the L-M transition at which point it grabs the wrong page.



      Navigating to authora.html
      Navigating to authorb.html
      Navigating to authorc.html
      Navigating to authord.html
      Navigating to authore.html
      Navigating to authorf.html
      Navigating to authorg.html
      Navigating to authorh.html
      Navigating to authora.html
      Navigating to authorj.html
      Navigating to authork.html
      Navigating to authorl.html
      Navigating to http://community.geosociety.org/gsa2016/home


      Solution 2 (CSS call to link)
      Using a CSS selector on the page, each lettered page is identified as an "a:nth-child(1-26)". So I reconstructed my loop using a call for that CSS identifier.



      tempList <- list()
      for(i in 2:length(lttrs)){
      tempList[[i]] <- website %>%
      follow_link(css = paste('a:nth-child(',i,')',sep = '')) %>%
      html_nodes(xpath ='//*[@class = "author"]') %>%
      html_text()
      }


      This works kindof. Again it has trouble with certain transitions (see below)



      Navigating to authora.html
      Navigating to uploadlistall.html
      Navigating to http://community.geosociety.org/gsa2016/home
      Navigating to authore.html
      Navigating to authorf.html
      Navigating to authorg.html
      Navigating to authorh.html
      Navigating to authori.html
      Navigating to authorj.html
      Navigating to authork.html
      Navigating to authorl.html
      Navigating to authorm.html
      Navigating to authorn.html
      Navigating to authoro.html
      Navigating to authorp.html
      Navigating to authorq.html
      Navigating to authorr.html
      Navigating to authors.html
      Navigating to authort.html
      Navigating to authoru.html
      Navigating to authorv.html
      Navigating to authorw.html
      Navigating to authorx.html
      Navigating to authory.html
      Navigating to authorz.html


      Specifically, this method misses B,C, and D. Looping to the incorrect pages at this step. I would greatly appreciate any insights or directions for how my above code could be reconfigured to correctly loop through all 26 alphabetical pages.



      Thank you so much!










      share|improve this question














      After spending a lot of time on this problem, and looking through available answers, I wanted to go ahead and ask a new question to address the problem I have webscraping with R and rvest. I have attempted to fully lay out the problem to minimize questions



      The Problem
      I am trying to extract the author names from a conference webpage. The authors are separated alphabetically by their last name; hence, I need to use a for loop to call follow_link() 25 times to go to each page and extract the pertinent author text.



      The conference website:
      https://gsa.confex.com/gsa/2016AM/webprogram/authora.html



      I have attempted two solutions in R using rvest, both with problems.



      Solution 1 (Letter call to link)



      lttrs <- LETTERS[seq( from = 1, to = 26 )] # create character vector
      website <- html_session(https://gsa.confex.com/gsa/2016AM/webprogram/authora.html)

      tempList <- list() #create list to store each page's author information

      for(i in 1:length(lttrs)){
      tempList[[i]] <- website %>%
      follow_link(lttrs[i])%>% #use capital letters to call links to author pages
      html_nodes(xpath ='//*[@class = "author"]') %>%
      html_text()
      }


      This code works.. to a point. Below is the output. It will successfully navigate through lettered pages until the H-I transition and the L-M transition at which point it grabs the wrong page.



      Navigating to authora.html
      Navigating to authorb.html
      Navigating to authorc.html
      Navigating to authord.html
      Navigating to authore.html
      Navigating to authorf.html
      Navigating to authorg.html
      Navigating to authorh.html
      Navigating to authora.html
      Navigating to authorj.html
      Navigating to authork.html
      Navigating to authorl.html
      Navigating to http://community.geosociety.org/gsa2016/home


      Solution 2 (CSS call to link)
      Using a CSS selector on the page, each lettered page is identified as an "a:nth-child(1-26)". So I reconstructed my loop using a call for that CSS identifier.



      tempList <- list()
      for(i in 2:length(lttrs)){
      tempList[[i]] <- website %>%
      follow_link(css = paste('a:nth-child(',i,')',sep = '')) %>%
      html_nodes(xpath ='//*[@class = "author"]') %>%
      html_text()
      }


      This works kindof. Again it has trouble with certain transitions (see below)



      Navigating to authora.html
      Navigating to uploadlistall.html
      Navigating to http://community.geosociety.org/gsa2016/home
      Navigating to authore.html
      Navigating to authorf.html
      Navigating to authorg.html
      Navigating to authorh.html
      Navigating to authori.html
      Navigating to authorj.html
      Navigating to authork.html
      Navigating to authorl.html
      Navigating to authorm.html
      Navigating to authorn.html
      Navigating to authoro.html
      Navigating to authorp.html
      Navigating to authorq.html
      Navigating to authorr.html
      Navigating to authors.html
      Navigating to authort.html
      Navigating to authoru.html
      Navigating to authorv.html
      Navigating to authorw.html
      Navigating to authorx.html
      Navigating to authory.html
      Navigating to authorz.html


      Specifically, this method misses B,C, and D. Looping to the incorrect pages at this step. I would greatly appreciate any insights or directions for how my above code could be reconfigured to correctly loop through all 26 alphabetical pages.



      Thank you so much!







      css r web-scraping rvest






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 25 '18 at 14:37









      morrismcmorrismc

      82




      82
























          1 Answer
          1






          active

          oldest

          votes


















          1














          Welcome to SO (and kudos on a 👍🏼 first question).



          You seem to have gotten super lucky as the robots.txt for that site has a ton of entries but doesn't try to restrict what you're doing.



          We can pull all of the hrefs in the alphabet pagination links at the bottom of the page with html_nodes(pg, "a[href^='author']"). Below grabs all the paper links from all the authors:



          library(rvest)
          library(tidyverse)

          pg <- read_html("https://gsa.confex.com/gsa/2016AM/webprogram/authora.html")

          html_nodes(pg, "a[href^='author']") %>%
          html_attr("href") %>%
          sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>%
          { pb <<- progress_estimated(length(.)) ; . } %>% # we'll use a progress bar as this will take ~3m
          map_df(~{

          pb$tick()$print() # increment progress bar

          Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay

          read_html(.x) %>%
          html_nodes("div.item > div.author") %>%
          map_df(~{
          data_frame(
          author = html_text(.x, trim = TRUE),
          paper = html_nodes(.x, xpath="../div[@class='papers']/a") %>%
          html_text(trim = TRUE),
          paper_url = html_nodes(.x, xpath="../div[@class='papers']/a") %>%
          html_attr("href") %>%
          sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .)
          )
          })
          }) -> author_papers

          author_papers
          ## # A tibble: 34,983 x 3
          ## author paper paper_url
          ## <chr> <chr> <chr>
          ## 1 Aadahl, Kristopher 296-5 https://gsa.confex.com/gsa/2016AM/webprogram/Paper283542.html
          ## 2 Aanderud, Zachary T. 215-11 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286442.html
          ## 3 Abbey, Alyssa 54-4 https://gsa.confex.com/gsa/2016AM/webprogram/Paper281801.html
          ## 4 Abbott, Dallas H. 341-34 https://gsa.confex.com/gsa/2016AM/webprogram/Paper287404.html
          ## 5 Abbott Jr., David M. 38-6 https://gsa.confex.com/gsa/2016AM/webprogram/Paper278060.html
          ## 6 Abbott, Grant 58-7 https://gsa.confex.com/gsa/2016AM/webprogram/Paper283414.html
          ## 7 Abbott, Jared 29-10 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286237.html
          ## 8 Abbott, Jared 317-9 https://gsa.confex.com/gsa/2016AM/webprogram/Paper282386.html
          ## 9 Abbott, Kathryn A. 187-9 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286127.html
          ## 10 Abbott, Lon D. 208-16 https://gsa.confex.com/gsa/2016AM/webprogram/Paper280093.html
          ## # ... with 34,973 more rows


          I don't know what you need off of the individual paper pages so you can do that.



          You also don't have to wait ~3m since the the author_papers data frame is in this RDS file: https://rud.is/dl/author-papers.rds which you can read with:



          readRDS(url("https://rud.is/dl/author-papers.rds"))


          If you do plan on scraping 34,983 papers then please continue to heed to "don't be rude" and use a crawl delay (ref: https://rud.is/b/2017/07/28/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r/).



          UPDATE



          html_nodes(pg, "a[href^='author']") %>% 
          html_attr("href") %>%
          sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>%
          { pb <<- progress_estimated(length(.)) ; . } %>% # we'll use a progress bar as this will take ~3m
          map_df(~{

          pb$tick()$print() # increment progress bar

          Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay

          read_html(.x) %>%
          html_nodes("div.item > div.author") %>%
          map_df(~{
          data_frame(
          author = html_text(.x, trim = TRUE),
          is_presenting = html_nodes(.x, xpath="../div[@class='papers']") %>%
          html_text(trim = TRUE) %>% # retrieve the text of all the "papers"
          paste0(collapse=" ") %>% # just in case there are multiple nodes we flatten them into one
          grepl("*", ., fixed=TRUE) # make it TRUE if we find the "*"
          )
          })
          }) -> author_with_presenter_status

          author_with_presenter_status
          ## # A tibble: 22,545 x 2
          ## author is_presenting
          ## <chr> <lgl>
          ## 1 Aadahl, Kristopher FALSE
          ## 2 Aanderud, Zachary T. FALSE
          ## 3 Abbey, Alyssa TRUE
          ## 4 Abbott, Dallas H. FALSE
          ## 5 Abbott Jr., David M. TRUE
          ## 6 Abbott, Grant FALSE
          ## 7 Abbott, Jared FALSE
          ## 8 Abbott, Kathryn A. FALSE
          ## 9 Abbott, Lon D. FALSE
          ## 10 Abbott, Mark B. FALSE
          ## # ... with 22,535 more rows


          Which you can also retrieve with:



          readRDS(url("https://rud.is/dl/author-presenter.rds"))





          share|improve this answer


























          • Thanks for this! It may be doing a bit more than I need. All I need to do is collect all of the author names from each page from the 2016 conference. Perhaps this does a bit more than I need in terms of scraping. However, I was able to clean up the dataframe with unique() as there are multiple counts of authors who are listed on multiple abstracts.

            – morrismc
            Nov 25 '18 at 15:45













          • Ah one more question, if I wanted to note whether or not someone was the presenting author, denoted with an * in the page. Would there be a simply solution to create a new column of True/False for presenting status?

            – morrismc
            Nov 25 '18 at 15:51











          • Updated to just get the author and whether they were a presenter.

            – hrbrmstr
            Nov 25 '18 at 16:07











          • Woohoo Thanks so much! I really appreciate it :)

            – morrismc
            Nov 25 '18 at 16:26











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53468576%2flooping-through-alphabetical-pages-rvest%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          Welcome to SO (and kudos on a 👍🏼 first question).



          You seem to have gotten super lucky as the robots.txt for that site has a ton of entries but doesn't try to restrict what you're doing.



          We can pull all of the hrefs in the alphabet pagination links at the bottom of the page with html_nodes(pg, "a[href^='author']"). Below grabs all the paper links from all the authors:



          library(rvest)
          library(tidyverse)

          pg <- read_html("https://gsa.confex.com/gsa/2016AM/webprogram/authora.html")

          html_nodes(pg, "a[href^='author']") %>%
          html_attr("href") %>%
          sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>%
          { pb <<- progress_estimated(length(.)) ; . } %>% # we'll use a progress bar as this will take ~3m
          map_df(~{

          pb$tick()$print() # increment progress bar

          Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay

          read_html(.x) %>%
          html_nodes("div.item > div.author") %>%
          map_df(~{
          data_frame(
          author = html_text(.x, trim = TRUE),
          paper = html_nodes(.x, xpath="../div[@class='papers']/a") %>%
          html_text(trim = TRUE),
          paper_url = html_nodes(.x, xpath="../div[@class='papers']/a") %>%
          html_attr("href") %>%
          sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .)
          )
          })
          }) -> author_papers

          author_papers
          ## # A tibble: 34,983 x 3
          ## author paper paper_url
          ## <chr> <chr> <chr>
          ## 1 Aadahl, Kristopher 296-5 https://gsa.confex.com/gsa/2016AM/webprogram/Paper283542.html
          ## 2 Aanderud, Zachary T. 215-11 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286442.html
          ## 3 Abbey, Alyssa 54-4 https://gsa.confex.com/gsa/2016AM/webprogram/Paper281801.html
          ## 4 Abbott, Dallas H. 341-34 https://gsa.confex.com/gsa/2016AM/webprogram/Paper287404.html
          ## 5 Abbott Jr., David M. 38-6 https://gsa.confex.com/gsa/2016AM/webprogram/Paper278060.html
          ## 6 Abbott, Grant 58-7 https://gsa.confex.com/gsa/2016AM/webprogram/Paper283414.html
          ## 7 Abbott, Jared 29-10 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286237.html
          ## 8 Abbott, Jared 317-9 https://gsa.confex.com/gsa/2016AM/webprogram/Paper282386.html
          ## 9 Abbott, Kathryn A. 187-9 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286127.html
          ## 10 Abbott, Lon D. 208-16 https://gsa.confex.com/gsa/2016AM/webprogram/Paper280093.html
          ## # ... with 34,973 more rows


          I don't know what you need off of the individual paper pages so you can do that.



          You also don't have to wait ~3m since the the author_papers data frame is in this RDS file: https://rud.is/dl/author-papers.rds which you can read with:



          readRDS(url("https://rud.is/dl/author-papers.rds"))


          If you do plan on scraping 34,983 papers then please continue to heed to "don't be rude" and use a crawl delay (ref: https://rud.is/b/2017/07/28/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r/).



          UPDATE



          html_nodes(pg, "a[href^='author']") %>% 
          html_attr("href") %>%
          sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>%
          { pb <<- progress_estimated(length(.)) ; . } %>% # we'll use a progress bar as this will take ~3m
          map_df(~{

          pb$tick()$print() # increment progress bar

          Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay

          read_html(.x) %>%
          html_nodes("div.item > div.author") %>%
          map_df(~{
          data_frame(
          author = html_text(.x, trim = TRUE),
          is_presenting = html_nodes(.x, xpath="../div[@class='papers']") %>%
          html_text(trim = TRUE) %>% # retrieve the text of all the "papers"
          paste0(collapse=" ") %>% # just in case there are multiple nodes we flatten them into one
          grepl("*", ., fixed=TRUE) # make it TRUE if we find the "*"
          )
          })
          }) -> author_with_presenter_status

          author_with_presenter_status
          ## # A tibble: 22,545 x 2
          ## author is_presenting
          ## <chr> <lgl>
          ## 1 Aadahl, Kristopher FALSE
          ## 2 Aanderud, Zachary T. FALSE
          ## 3 Abbey, Alyssa TRUE
          ## 4 Abbott, Dallas H. FALSE
          ## 5 Abbott Jr., David M. TRUE
          ## 6 Abbott, Grant FALSE
          ## 7 Abbott, Jared FALSE
          ## 8 Abbott, Kathryn A. FALSE
          ## 9 Abbott, Lon D. FALSE
          ## 10 Abbott, Mark B. FALSE
          ## # ... with 22,535 more rows


          Which you can also retrieve with:



          readRDS(url("https://rud.is/dl/author-presenter.rds"))





          share|improve this answer


























          • Thanks for this! It may be doing a bit more than I need. All I need to do is collect all of the author names from each page from the 2016 conference. Perhaps this does a bit more than I need in terms of scraping. However, I was able to clean up the dataframe with unique() as there are multiple counts of authors who are listed on multiple abstracts.

            – morrismc
            Nov 25 '18 at 15:45













          • Ah one more question, if I wanted to note whether or not someone was the presenting author, denoted with an * in the page. Would there be a simply solution to create a new column of True/False for presenting status?

            – morrismc
            Nov 25 '18 at 15:51











          • Updated to just get the author and whether they were a presenter.

            – hrbrmstr
            Nov 25 '18 at 16:07











          • Woohoo Thanks so much! I really appreciate it :)

            – morrismc
            Nov 25 '18 at 16:26
















          1














          Welcome to SO (and kudos on a 👍🏼 first question).



          You seem to have gotten super lucky as the robots.txt for that site has a ton of entries but doesn't try to restrict what you're doing.



          We can pull all of the hrefs in the alphabet pagination links at the bottom of the page with html_nodes(pg, "a[href^='author']"). Below grabs all the paper links from all the authors:



          library(rvest)
          library(tidyverse)

          pg <- read_html("https://gsa.confex.com/gsa/2016AM/webprogram/authora.html")

          html_nodes(pg, "a[href^='author']") %>%
          html_attr("href") %>%
          sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>%
          { pb <<- progress_estimated(length(.)) ; . } %>% # we'll use a progress bar as this will take ~3m
          map_df(~{

          pb$tick()$print() # increment progress bar

          Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay

          read_html(.x) %>%
          html_nodes("div.item > div.author") %>%
          map_df(~{
          data_frame(
          author = html_text(.x, trim = TRUE),
          paper = html_nodes(.x, xpath="../div[@class='papers']/a") %>%
          html_text(trim = TRUE),
          paper_url = html_nodes(.x, xpath="../div[@class='papers']/a") %>%
          html_attr("href") %>%
          sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .)
          )
          })
          }) -> author_papers

          author_papers
          ## # A tibble: 34,983 x 3
          ## author paper paper_url
          ## <chr> <chr> <chr>
          ## 1 Aadahl, Kristopher 296-5 https://gsa.confex.com/gsa/2016AM/webprogram/Paper283542.html
          ## 2 Aanderud, Zachary T. 215-11 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286442.html
          ## 3 Abbey, Alyssa 54-4 https://gsa.confex.com/gsa/2016AM/webprogram/Paper281801.html
          ## 4 Abbott, Dallas H. 341-34 https://gsa.confex.com/gsa/2016AM/webprogram/Paper287404.html
          ## 5 Abbott Jr., David M. 38-6 https://gsa.confex.com/gsa/2016AM/webprogram/Paper278060.html
          ## 6 Abbott, Grant 58-7 https://gsa.confex.com/gsa/2016AM/webprogram/Paper283414.html
          ## 7 Abbott, Jared 29-10 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286237.html
          ## 8 Abbott, Jared 317-9 https://gsa.confex.com/gsa/2016AM/webprogram/Paper282386.html
          ## 9 Abbott, Kathryn A. 187-9 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286127.html
          ## 10 Abbott, Lon D. 208-16 https://gsa.confex.com/gsa/2016AM/webprogram/Paper280093.html
          ## # ... with 34,973 more rows


          I don't know what you need off of the individual paper pages so you can do that.



          You also don't have to wait ~3m since the the author_papers data frame is in this RDS file: https://rud.is/dl/author-papers.rds which you can read with:



          readRDS(url("https://rud.is/dl/author-papers.rds"))


          If you do plan on scraping 34,983 papers then please continue to heed to "don't be rude" and use a crawl delay (ref: https://rud.is/b/2017/07/28/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r/).



          UPDATE



          html_nodes(pg, "a[href^='author']") %>% 
          html_attr("href") %>%
          sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>%
          { pb <<- progress_estimated(length(.)) ; . } %>% # we'll use a progress bar as this will take ~3m
          map_df(~{

          pb$tick()$print() # increment progress bar

          Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay

          read_html(.x) %>%
          html_nodes("div.item > div.author") %>%
          map_df(~{
          data_frame(
          author = html_text(.x, trim = TRUE),
          is_presenting = html_nodes(.x, xpath="../div[@class='papers']") %>%
          html_text(trim = TRUE) %>% # retrieve the text of all the "papers"
          paste0(collapse=" ") %>% # just in case there are multiple nodes we flatten them into one
          grepl("*", ., fixed=TRUE) # make it TRUE if we find the "*"
          )
          })
          }) -> author_with_presenter_status

          author_with_presenter_status
          ## # A tibble: 22,545 x 2
          ## author is_presenting
          ## <chr> <lgl>
          ## 1 Aadahl, Kristopher FALSE
          ## 2 Aanderud, Zachary T. FALSE
          ## 3 Abbey, Alyssa TRUE
          ## 4 Abbott, Dallas H. FALSE
          ## 5 Abbott Jr., David M. TRUE
          ## 6 Abbott, Grant FALSE
          ## 7 Abbott, Jared FALSE
          ## 8 Abbott, Kathryn A. FALSE
          ## 9 Abbott, Lon D. FALSE
          ## 10 Abbott, Mark B. FALSE
          ## # ... with 22,535 more rows


          Which you can also retrieve with:



          readRDS(url("https://rud.is/dl/author-presenter.rds"))





          share|improve this answer


























          • Thanks for this! It may be doing a bit more than I need. All I need to do is collect all of the author names from each page from the 2016 conference. Perhaps this does a bit more than I need in terms of scraping. However, I was able to clean up the dataframe with unique() as there are multiple counts of authors who are listed on multiple abstracts.

            – morrismc
            Nov 25 '18 at 15:45













          • Ah one more question, if I wanted to note whether or not someone was the presenting author, denoted with an * in the page. Would there be a simply solution to create a new column of True/False for presenting status?

            – morrismc
            Nov 25 '18 at 15:51











          • Updated to just get the author and whether they were a presenter.

            – hrbrmstr
            Nov 25 '18 at 16:07











          • Woohoo Thanks so much! I really appreciate it :)

            – morrismc
            Nov 25 '18 at 16:26














          1












          1








          1







          Welcome to SO (and kudos on a 👍🏼 first question).



          You seem to have gotten super lucky as the robots.txt for that site has a ton of entries but doesn't try to restrict what you're doing.



          We can pull all of the hrefs in the alphabet pagination links at the bottom of the page with html_nodes(pg, "a[href^='author']"). Below grabs all the paper links from all the authors:



          library(rvest)
          library(tidyverse)

          pg <- read_html("https://gsa.confex.com/gsa/2016AM/webprogram/authora.html")

          html_nodes(pg, "a[href^='author']") %>%
          html_attr("href") %>%
          sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>%
          { pb <<- progress_estimated(length(.)) ; . } %>% # we'll use a progress bar as this will take ~3m
          map_df(~{

          pb$tick()$print() # increment progress bar

          Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay

          read_html(.x) %>%
          html_nodes("div.item > div.author") %>%
          map_df(~{
          data_frame(
          author = html_text(.x, trim = TRUE),
          paper = html_nodes(.x, xpath="../div[@class='papers']/a") %>%
          html_text(trim = TRUE),
          paper_url = html_nodes(.x, xpath="../div[@class='papers']/a") %>%
          html_attr("href") %>%
          sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .)
          )
          })
          }) -> author_papers

          author_papers
          ## # A tibble: 34,983 x 3
          ## author paper paper_url
          ## <chr> <chr> <chr>
          ## 1 Aadahl, Kristopher 296-5 https://gsa.confex.com/gsa/2016AM/webprogram/Paper283542.html
          ## 2 Aanderud, Zachary T. 215-11 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286442.html
          ## 3 Abbey, Alyssa 54-4 https://gsa.confex.com/gsa/2016AM/webprogram/Paper281801.html
          ## 4 Abbott, Dallas H. 341-34 https://gsa.confex.com/gsa/2016AM/webprogram/Paper287404.html
          ## 5 Abbott Jr., David M. 38-6 https://gsa.confex.com/gsa/2016AM/webprogram/Paper278060.html
          ## 6 Abbott, Grant 58-7 https://gsa.confex.com/gsa/2016AM/webprogram/Paper283414.html
          ## 7 Abbott, Jared 29-10 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286237.html
          ## 8 Abbott, Jared 317-9 https://gsa.confex.com/gsa/2016AM/webprogram/Paper282386.html
          ## 9 Abbott, Kathryn A. 187-9 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286127.html
          ## 10 Abbott, Lon D. 208-16 https://gsa.confex.com/gsa/2016AM/webprogram/Paper280093.html
          ## # ... with 34,973 more rows


          I don't know what you need off of the individual paper pages so you can do that.



          You also don't have to wait ~3m since the the author_papers data frame is in this RDS file: https://rud.is/dl/author-papers.rds which you can read with:



          readRDS(url("https://rud.is/dl/author-papers.rds"))


          If you do plan on scraping 34,983 papers then please continue to heed to "don't be rude" and use a crawl delay (ref: https://rud.is/b/2017/07/28/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r/).



          UPDATE



          html_nodes(pg, "a[href^='author']") %>% 
          html_attr("href") %>%
          sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>%
          { pb <<- progress_estimated(length(.)) ; . } %>% # we'll use a progress bar as this will take ~3m
          map_df(~{

          pb$tick()$print() # increment progress bar

          Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay

          read_html(.x) %>%
          html_nodes("div.item > div.author") %>%
          map_df(~{
          data_frame(
          author = html_text(.x, trim = TRUE),
          is_presenting = html_nodes(.x, xpath="../div[@class='papers']") %>%
          html_text(trim = TRUE) %>% # retrieve the text of all the "papers"
          paste0(collapse=" ") %>% # just in case there are multiple nodes we flatten them into one
          grepl("*", ., fixed=TRUE) # make it TRUE if we find the "*"
          )
          })
          }) -> author_with_presenter_status

          author_with_presenter_status
          ## # A tibble: 22,545 x 2
          ## author is_presenting
          ## <chr> <lgl>
          ## 1 Aadahl, Kristopher FALSE
          ## 2 Aanderud, Zachary T. FALSE
          ## 3 Abbey, Alyssa TRUE
          ## 4 Abbott, Dallas H. FALSE
          ## 5 Abbott Jr., David M. TRUE
          ## 6 Abbott, Grant FALSE
          ## 7 Abbott, Jared FALSE
          ## 8 Abbott, Kathryn A. FALSE
          ## 9 Abbott, Lon D. FALSE
          ## 10 Abbott, Mark B. FALSE
          ## # ... with 22,535 more rows


          Which you can also retrieve with:



          readRDS(url("https://rud.is/dl/author-presenter.rds"))





          share|improve this answer















          Welcome to SO (and kudos on a 👍🏼 first question).



          You seem to have gotten super lucky as the robots.txt for that site has a ton of entries but doesn't try to restrict what you're doing.



          We can pull all of the hrefs in the alphabet pagination links at the bottom of the page with html_nodes(pg, "a[href^='author']"). Below grabs all the paper links from all the authors:



          library(rvest)
          library(tidyverse)

          pg <- read_html("https://gsa.confex.com/gsa/2016AM/webprogram/authora.html")

          html_nodes(pg, "a[href^='author']") %>%
          html_attr("href") %>%
          sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>%
          { pb <<- progress_estimated(length(.)) ; . } %>% # we'll use a progress bar as this will take ~3m
          map_df(~{

          pb$tick()$print() # increment progress bar

          Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay

          read_html(.x) %>%
          html_nodes("div.item > div.author") %>%
          map_df(~{
          data_frame(
          author = html_text(.x, trim = TRUE),
          paper = html_nodes(.x, xpath="../div[@class='papers']/a") %>%
          html_text(trim = TRUE),
          paper_url = html_nodes(.x, xpath="../div[@class='papers']/a") %>%
          html_attr("href") %>%
          sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .)
          )
          })
          }) -> author_papers

          author_papers
          ## # A tibble: 34,983 x 3
          ## author paper paper_url
          ## <chr> <chr> <chr>
          ## 1 Aadahl, Kristopher 296-5 https://gsa.confex.com/gsa/2016AM/webprogram/Paper283542.html
          ## 2 Aanderud, Zachary T. 215-11 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286442.html
          ## 3 Abbey, Alyssa 54-4 https://gsa.confex.com/gsa/2016AM/webprogram/Paper281801.html
          ## 4 Abbott, Dallas H. 341-34 https://gsa.confex.com/gsa/2016AM/webprogram/Paper287404.html
          ## 5 Abbott Jr., David M. 38-6 https://gsa.confex.com/gsa/2016AM/webprogram/Paper278060.html
          ## 6 Abbott, Grant 58-7 https://gsa.confex.com/gsa/2016AM/webprogram/Paper283414.html
          ## 7 Abbott, Jared 29-10 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286237.html
          ## 8 Abbott, Jared 317-9 https://gsa.confex.com/gsa/2016AM/webprogram/Paper282386.html
          ## 9 Abbott, Kathryn A. 187-9 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286127.html
          ## 10 Abbott, Lon D. 208-16 https://gsa.confex.com/gsa/2016AM/webprogram/Paper280093.html
          ## # ... with 34,973 more rows


          I don't know what you need off of the individual paper pages so you can do that.



          You also don't have to wait ~3m since the the author_papers data frame is in this RDS file: https://rud.is/dl/author-papers.rds which you can read with:



          readRDS(url("https://rud.is/dl/author-papers.rds"))


          If you do plan on scraping 34,983 papers then please continue to heed to "don't be rude" and use a crawl delay (ref: https://rud.is/b/2017/07/28/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r/).



          UPDATE



          html_nodes(pg, "a[href^='author']") %>% 
          html_attr("href") %>%
          sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>%
          { pb <<- progress_estimated(length(.)) ; . } %>% # we'll use a progress bar as this will take ~3m
          map_df(~{

          pb$tick()$print() # increment progress bar

          Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay

          read_html(.x) %>%
          html_nodes("div.item > div.author") %>%
          map_df(~{
          data_frame(
          author = html_text(.x, trim = TRUE),
          is_presenting = html_nodes(.x, xpath="../div[@class='papers']") %>%
          html_text(trim = TRUE) %>% # retrieve the text of all the "papers"
          paste0(collapse=" ") %>% # just in case there are multiple nodes we flatten them into one
          grepl("*", ., fixed=TRUE) # make it TRUE if we find the "*"
          )
          })
          }) -> author_with_presenter_status

          author_with_presenter_status
          ## # A tibble: 22,545 x 2
          ## author is_presenting
          ## <chr> <lgl>
          ## 1 Aadahl, Kristopher FALSE
          ## 2 Aanderud, Zachary T. FALSE
          ## 3 Abbey, Alyssa TRUE
          ## 4 Abbott, Dallas H. FALSE
          ## 5 Abbott Jr., David M. TRUE
          ## 6 Abbott, Grant FALSE
          ## 7 Abbott, Jared FALSE
          ## 8 Abbott, Kathryn A. FALSE
          ## 9 Abbott, Lon D. FALSE
          ## 10 Abbott, Mark B. FALSE
          ## # ... with 22,535 more rows


          Which you can also retrieve with:



          readRDS(url("https://rud.is/dl/author-presenter.rds"))






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 25 '18 at 16:07

























          answered Nov 25 '18 at 15:00









          hrbrmstrhrbrmstr

          61.3k691152




          61.3k691152













          • Thanks for this! It may be doing a bit more than I need. All I need to do is collect all of the author names from each page from the 2016 conference. Perhaps this does a bit more than I need in terms of scraping. However, I was able to clean up the dataframe with unique() as there are multiple counts of authors who are listed on multiple abstracts.

            – morrismc
            Nov 25 '18 at 15:45













          • Ah one more question, if I wanted to note whether or not someone was the presenting author, denoted with an * in the page. Would there be a simply solution to create a new column of True/False for presenting status?

            – morrismc
            Nov 25 '18 at 15:51











          • Updated to just get the author and whether they were a presenter.

            – hrbrmstr
            Nov 25 '18 at 16:07











          • Woohoo Thanks so much! I really appreciate it :)

            – morrismc
            Nov 25 '18 at 16:26



















          • Thanks for this! It may be doing a bit more than I need. All I need to do is collect all of the author names from each page from the 2016 conference. Perhaps this does a bit more than I need in terms of scraping. However, I was able to clean up the dataframe with unique() as there are multiple counts of authors who are listed on multiple abstracts.

            – morrismc
            Nov 25 '18 at 15:45













          • Ah one more question, if I wanted to note whether or not someone was the presenting author, denoted with an * in the page. Would there be a simply solution to create a new column of True/False for presenting status?

            – morrismc
            Nov 25 '18 at 15:51











          • Updated to just get the author and whether they were a presenter.

            – hrbrmstr
            Nov 25 '18 at 16:07











          • Woohoo Thanks so much! I really appreciate it :)

            – morrismc
            Nov 25 '18 at 16:26

















          Thanks for this! It may be doing a bit more than I need. All I need to do is collect all of the author names from each page from the 2016 conference. Perhaps this does a bit more than I need in terms of scraping. However, I was able to clean up the dataframe with unique() as there are multiple counts of authors who are listed on multiple abstracts.

          – morrismc
          Nov 25 '18 at 15:45







          Thanks for this! It may be doing a bit more than I need. All I need to do is collect all of the author names from each page from the 2016 conference. Perhaps this does a bit more than I need in terms of scraping. However, I was able to clean up the dataframe with unique() as there are multiple counts of authors who are listed on multiple abstracts.

          – morrismc
          Nov 25 '18 at 15:45















          Ah one more question, if I wanted to note whether or not someone was the presenting author, denoted with an * in the page. Would there be a simply solution to create a new column of True/False for presenting status?

          – morrismc
          Nov 25 '18 at 15:51





          Ah one more question, if I wanted to note whether or not someone was the presenting author, denoted with an * in the page. Would there be a simply solution to create a new column of True/False for presenting status?

          – morrismc
          Nov 25 '18 at 15:51













          Updated to just get the author and whether they were a presenter.

          – hrbrmstr
          Nov 25 '18 at 16:07





          Updated to just get the author and whether they were a presenter.

          – hrbrmstr
          Nov 25 '18 at 16:07













          Woohoo Thanks so much! I really appreciate it :)

          – morrismc
          Nov 25 '18 at 16:26





          Woohoo Thanks so much! I really appreciate it :)

          – morrismc
          Nov 25 '18 at 16:26




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53468576%2flooping-through-alphabetical-pages-rvest%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          404 Error Contact Form 7 ajax form submitting

          How to know if a Active Directory user can login interactively

          TypeError: fit_transform() missing 1 required positional argument: 'X'