Scrapy doesn't keep cookies for some pages












0















I am trying to parse the articles from this site: https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t



When you visit the website for the first time, you are prompted to accept cookies. It seems that they store the consent in DSGVO_ZUSAGE_V1:true because when I scrape like this it works:



def start_requests(self):
urls = [
'https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse,
cookies={'DSGVO_ZUSAGE_V1':'true'})

def parse(self, response):
base_query = '//div[contains(@class,"contentLeft")]/ul[contains(@class, "stories")]/li'
articles = response.xpath(base_query)
for index, value in enumerate(articles):
article_url = validateResponse(response,
base_query + '/div[contains(@class,"text")]/h3/a/@href',
index)
request = scrapy.Request(response.urljoin(article_url), callback=self.parseSingleArticle, cookies={'DSGVO_ZUSAGE_V1':'true'})
yield request

def parseSingleArticle(self, response):
article_content = ''
article_date = validateResponse(response, "//h6[contains(@class,'info')]/span[contains(@class,'date')]/text()")
article_title = validateResponse(response, "//h1[contains(@itemprop,'headline')]/text()")
query = "//div[contains(@class,'copytext')]//child::text()"
article_content_response = response.xpath(query)
for index, value in enumerate(article_content_response):
article_content += " " + validateResponse(response, query, index)
yield self.article_to_pipeline(article_content, response.url, article_date, article_title)

def article_to_pipeline(self, article_content, url, article_date, article_title):
article_item = ArticleItem()
# some other stuff
return article_item


And for some articles this works perfect. From the debug:



2018-11-24 10:01:58 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000091420120/60-Stunden-pro-Woche-Fuer-Unternehmer-ist-das-die-Realitaet>
Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38


And I get everything I wanted for this article.



But then there are some articles where it doesn't work. For example this doesn't return anything:



2018-11-24 10:02:07 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000090136491/Lufthunderter-fuer-E-Autos-faellt>
Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38


Also when I access this website through Chrome I get the "accept cookie prompt". Despite having accepted this cookie before for some other page on their website. When I do that they again save the confirmation under DSGVO_ZUSAGE_V1:true and MGUID (Deleting them from chrome puts the "accept cookie prompt" back).



Anybody any ideas? I have tried other cookies but MGUID and DSGVO_ZUSAGE_V1 are the only ones that make a difference.










share|improve this question



























    0















    I am trying to parse the articles from this site: https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t



    When you visit the website for the first time, you are prompted to accept cookies. It seems that they store the consent in DSGVO_ZUSAGE_V1:true because when I scrape like this it works:



    def start_requests(self):
    urls = [
    'https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t'
    ]
    for url in urls:
    yield scrapy.Request(url=url, callback=self.parse,
    cookies={'DSGVO_ZUSAGE_V1':'true'})

    def parse(self, response):
    base_query = '//div[contains(@class,"contentLeft")]/ul[contains(@class, "stories")]/li'
    articles = response.xpath(base_query)
    for index, value in enumerate(articles):
    article_url = validateResponse(response,
    base_query + '/div[contains(@class,"text")]/h3/a/@href',
    index)
    request = scrapy.Request(response.urljoin(article_url), callback=self.parseSingleArticle, cookies={'DSGVO_ZUSAGE_V1':'true'})
    yield request

    def parseSingleArticle(self, response):
    article_content = ''
    article_date = validateResponse(response, "//h6[contains(@class,'info')]/span[contains(@class,'date')]/text()")
    article_title = validateResponse(response, "//h1[contains(@itemprop,'headline')]/text()")
    query = "//div[contains(@class,'copytext')]//child::text()"
    article_content_response = response.xpath(query)
    for index, value in enumerate(article_content_response):
    article_content += " " + validateResponse(response, query, index)
    yield self.article_to_pipeline(article_content, response.url, article_date, article_title)

    def article_to_pipeline(self, article_content, url, article_date, article_title):
    article_item = ArticleItem()
    # some other stuff
    return article_item


    And for some articles this works perfect. From the debug:



    2018-11-24 10:01:58 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000091420120/60-Stunden-pro-Woche-Fuer-Unternehmer-ist-das-die-Realitaet>
    Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38


    And I get everything I wanted for this article.



    But then there are some articles where it doesn't work. For example this doesn't return anything:



    2018-11-24 10:02:07 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000090136491/Lufthunderter-fuer-E-Autos-faellt>
    Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38


    Also when I access this website through Chrome I get the "accept cookie prompt". Despite having accepted this cookie before for some other page on their website. When I do that they again save the confirmation under DSGVO_ZUSAGE_V1:true and MGUID (Deleting them from chrome puts the "accept cookie prompt" back).



    Anybody any ideas? I have tried other cookies but MGUID and DSGVO_ZUSAGE_V1 are the only ones that make a difference.










    share|improve this question

























      0












      0








      0








      I am trying to parse the articles from this site: https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t



      When you visit the website for the first time, you are prompted to accept cookies. It seems that they store the consent in DSGVO_ZUSAGE_V1:true because when I scrape like this it works:



      def start_requests(self):
      urls = [
      'https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t'
      ]
      for url in urls:
      yield scrapy.Request(url=url, callback=self.parse,
      cookies={'DSGVO_ZUSAGE_V1':'true'})

      def parse(self, response):
      base_query = '//div[contains(@class,"contentLeft")]/ul[contains(@class, "stories")]/li'
      articles = response.xpath(base_query)
      for index, value in enumerate(articles):
      article_url = validateResponse(response,
      base_query + '/div[contains(@class,"text")]/h3/a/@href',
      index)
      request = scrapy.Request(response.urljoin(article_url), callback=self.parseSingleArticle, cookies={'DSGVO_ZUSAGE_V1':'true'})
      yield request

      def parseSingleArticle(self, response):
      article_content = ''
      article_date = validateResponse(response, "//h6[contains(@class,'info')]/span[contains(@class,'date')]/text()")
      article_title = validateResponse(response, "//h1[contains(@itemprop,'headline')]/text()")
      query = "//div[contains(@class,'copytext')]//child::text()"
      article_content_response = response.xpath(query)
      for index, value in enumerate(article_content_response):
      article_content += " " + validateResponse(response, query, index)
      yield self.article_to_pipeline(article_content, response.url, article_date, article_title)

      def article_to_pipeline(self, article_content, url, article_date, article_title):
      article_item = ArticleItem()
      # some other stuff
      return article_item


      And for some articles this works perfect. From the debug:



      2018-11-24 10:01:58 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000091420120/60-Stunden-pro-Woche-Fuer-Unternehmer-ist-das-die-Realitaet>
      Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38


      And I get everything I wanted for this article.



      But then there are some articles where it doesn't work. For example this doesn't return anything:



      2018-11-24 10:02:07 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000090136491/Lufthunderter-fuer-E-Autos-faellt>
      Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38


      Also when I access this website through Chrome I get the "accept cookie prompt". Despite having accepted this cookie before for some other page on their website. When I do that they again save the confirmation under DSGVO_ZUSAGE_V1:true and MGUID (Deleting them from chrome puts the "accept cookie prompt" back).



      Anybody any ideas? I have tried other cookies but MGUID and DSGVO_ZUSAGE_V1 are the only ones that make a difference.










      share|improve this question














      I am trying to parse the articles from this site: https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t



      When you visit the website for the first time, you are prompted to accept cookies. It seems that they store the consent in DSGVO_ZUSAGE_V1:true because when I scrape like this it works:



      def start_requests(self):
      urls = [
      'https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t'
      ]
      for url in urls:
      yield scrapy.Request(url=url, callback=self.parse,
      cookies={'DSGVO_ZUSAGE_V1':'true'})

      def parse(self, response):
      base_query = '//div[contains(@class,"contentLeft")]/ul[contains(@class, "stories")]/li'
      articles = response.xpath(base_query)
      for index, value in enumerate(articles):
      article_url = validateResponse(response,
      base_query + '/div[contains(@class,"text")]/h3/a/@href',
      index)
      request = scrapy.Request(response.urljoin(article_url), callback=self.parseSingleArticle, cookies={'DSGVO_ZUSAGE_V1':'true'})
      yield request

      def parseSingleArticle(self, response):
      article_content = ''
      article_date = validateResponse(response, "//h6[contains(@class,'info')]/span[contains(@class,'date')]/text()")
      article_title = validateResponse(response, "//h1[contains(@itemprop,'headline')]/text()")
      query = "//div[contains(@class,'copytext')]//child::text()"
      article_content_response = response.xpath(query)
      for index, value in enumerate(article_content_response):
      article_content += " " + validateResponse(response, query, index)
      yield self.article_to_pipeline(article_content, response.url, article_date, article_title)

      def article_to_pipeline(self, article_content, url, article_date, article_title):
      article_item = ArticleItem()
      # some other stuff
      return article_item


      And for some articles this works perfect. From the debug:



      2018-11-24 10:01:58 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000091420120/60-Stunden-pro-Woche-Fuer-Unternehmer-ist-das-die-Realitaet>
      Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38


      And I get everything I wanted for this article.



      But then there are some articles where it doesn't work. For example this doesn't return anything:



      2018-11-24 10:02:07 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000090136491/Lufthunderter-fuer-E-Autos-faellt>
      Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38


      Also when I access this website through Chrome I get the "accept cookie prompt". Despite having accepted this cookie before for some other page on their website. When I do that they again save the confirmation under DSGVO_ZUSAGE_V1:true and MGUID (Deleting them from chrome puts the "accept cookie prompt" back).



      Anybody any ideas? I have tried other cookies but MGUID and DSGVO_ZUSAGE_V1 are the only ones that make a difference.







      python cookies web-scraping scrapy scrapy-spider






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 24 '18 at 9:15









      drowyekdrowyek

      12




      12
























          0






          active

          oldest

          votes











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53456765%2fscrapy-doesnt-keep-cookies-for-some-pages%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes
















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53456765%2fscrapy-doesnt-keep-cookies-for-some-pages%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          404 Error Contact Form 7 ajax form submitting

          How to know if a Active Directory user can login interactively

          TypeError: fit_transform() missing 1 required positional argument: 'X'