Scrapy doesn't keep cookies for some pages

I am trying to parse the articles from this site: https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t

When you visit the website for the first time, you are prompted to accept cookies. It seems that they store the consent in DSGVO_ZUSAGE_V1:true because when I scrape like this it works:

def start_requests(self):

    urls = [

        'https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t'

    ]

    for url in urls:

        yield scrapy.Request(url=url, callback=self.parse,

                              cookies={'DSGVO_ZUSAGE_V1':'true'})



def parse(self, response):

    base_query = '//div[contains(@class,"contentLeft")]/ul[contains(@class, "stories")]/li'

    articles = response.xpath(base_query)

    for index, value in enumerate(articles):

        article_url = validateResponse(response,

                                       base_query + '/div[contains(@class,"text")]/h3/a/@href',

                                       index)

        request = scrapy.Request(response.urljoin(article_url), callback=self.parseSingleArticle, cookies={'DSGVO_ZUSAGE_V1':'true'})

        yield request



def parseSingleArticle(self, response):

    article_content = ''

    article_date = validateResponse(response, "//h6[contains(@class,'info')]/span[contains(@class,'date')]/text()")

    article_title = validateResponse(response, "//h1[contains(@itemprop,'headline')]/text()")

    query = "//div[contains(@class,'copytext')]//child::text()"

    article_content_response = response.xpath(query)

    for index, value in enumerate(article_content_response):

        article_content += " " + validateResponse(response, query, index)

    yield self.article_to_pipeline(article_content, response.url, article_date, article_title)



def article_to_pipeline(self, article_content, url, article_date, article_title):

    article_item = ArticleItem()

    # some other stuff

    return article_item

And for some articles this works perfect. From the debug:

2018-11-24 10:01:58 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000091420120/60-Stunden-pro-Woche-Fuer-Unternehmer-ist-das-die-Realitaet>

Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38

And I get everything I wanted for this article.

But then there are some articles where it doesn't work. For example this doesn't return anything:

2018-11-24 10:02:07 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000090136491/Lufthunderter-fuer-E-Autos-faellt>

Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38

Also when I access this website through Chrome I get the "accept cookie prompt". Despite having accepted this cookie before for some other page on their website. When I do that they again save the confirmation under DSGVO_ZUSAGE_V1:true and MGUID (Deleting them from chrome puts the "accept cookie prompt" back).

Anybody any ideas? I have tried other cookies but MGUID and DSGVO_ZUSAGE_V1 are the only ones that make a difference.

asked Nov 24 '18 at 9:15

drowyek

add a comment |

I am trying to parse the articles from this site: https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t

When you visit the website for the first time, you are prompted to accept cookies. It seems that they store the consent in DSGVO_ZUSAGE_V1:true because when I scrape like this it works:

def start_requests(self):

    urls = [

        'https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t'

    ]

    for url in urls:

        yield scrapy.Request(url=url, callback=self.parse,

                              cookies={'DSGVO_ZUSAGE_V1':'true'})



def parse(self, response):

    base_query = '//div[contains(@class,"contentLeft")]/ul[contains(@class, "stories")]/li'

    articles = response.xpath(base_query)

    for index, value in enumerate(articles):

        article_url = validateResponse(response,

                                       base_query + '/div[contains(@class,"text")]/h3/a/@href',

                                       index)

        request = scrapy.Request(response.urljoin(article_url), callback=self.parseSingleArticle, cookies={'DSGVO_ZUSAGE_V1':'true'})

        yield request



def parseSingleArticle(self, response):

    article_content = ''

    article_date = validateResponse(response, "//h6[contains(@class,'info')]/span[contains(@class,'date')]/text()")

    article_title = validateResponse(response, "//h1[contains(@itemprop,'headline')]/text()")

    query = "//div[contains(@class,'copytext')]//child::text()"

    article_content_response = response.xpath(query)

    for index, value in enumerate(article_content_response):

        article_content += " " + validateResponse(response, query, index)

    yield self.article_to_pipeline(article_content, response.url, article_date, article_title)



def article_to_pipeline(self, article_content, url, article_date, article_title):

    article_item = ArticleItem()

    # some other stuff

    return article_item

And for some articles this works perfect. From the debug:

2018-11-24 10:01:58 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000091420120/60-Stunden-pro-Woche-Fuer-Unternehmer-ist-das-die-Realitaet>

Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38

And I get everything I wanted for this article.

But then there are some articles where it doesn't work. For example this doesn't return anything:

2018-11-24 10:02:07 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000090136491/Lufthunderter-fuer-E-Autos-faellt>

Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38

Anybody any ideas? I have tried other cookies but MGUID and DSGVO_ZUSAGE_V1 are the only ones that make a difference.

asked Nov 24 '18 at 9:15

drowyek

add a comment |

I am trying to parse the articles from this site: https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t

When you visit the website for the first time, you are prompted to accept cookies. It seems that they store the consent in DSGVO_ZUSAGE_V1:true because when I scrape like this it works:

def start_requests(self):

    urls = [

        'https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t'

    ]

    for url in urls:

        yield scrapy.Request(url=url, callback=self.parse,

                              cookies={'DSGVO_ZUSAGE_V1':'true'})



def parse(self, response):

    base_query = '//div[contains(@class,"contentLeft")]/ul[contains(@class, "stories")]/li'

    articles = response.xpath(base_query)

    for index, value in enumerate(articles):

        article_url = validateResponse(response,

                                       base_query + '/div[contains(@class,"text")]/h3/a/@href',

                                       index)

        request = scrapy.Request(response.urljoin(article_url), callback=self.parseSingleArticle, cookies={'DSGVO_ZUSAGE_V1':'true'})

        yield request



def parseSingleArticle(self, response):

    article_content = ''

    article_date = validateResponse(response, "//h6[contains(@class,'info')]/span[contains(@class,'date')]/text()")

    article_title = validateResponse(response, "//h1[contains(@itemprop,'headline')]/text()")

    query = "//div[contains(@class,'copytext')]//child::text()"

    article_content_response = response.xpath(query)

    for index, value in enumerate(article_content_response):

        article_content += " " + validateResponse(response, query, index)

    yield self.article_to_pipeline(article_content, response.url, article_date, article_title)



def article_to_pipeline(self, article_content, url, article_date, article_title):

    article_item = ArticleItem()

    # some other stuff

    return article_item

And for some articles this works perfect. From the debug:

2018-11-24 10:01:58 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000091420120/60-Stunden-pro-Woche-Fuer-Unternehmer-ist-das-die-Realitaet>

Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38

And I get everything I wanted for this article.

But then there are some articles where it doesn't work. For example this doesn't return anything:

2018-11-24 10:02:07 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000090136491/Lufthunderter-fuer-E-Autos-faellt>

Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38

Anybody any ideas? I have tried other cookies but MGUID and DSGVO_ZUSAGE_V1 are the only ones that make a difference.

asked Nov 24 '18 at 9:15

drowyek

I am trying to parse the articles from this site: https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t

When you visit the website for the first time, you are prompted to accept cookies. It seems that they store the consent in DSGVO_ZUSAGE_V1:true because when I scrape like this it works:

def start_requests(self):

    urls = [

        'https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t'

    ]

    for url in urls:

        yield scrapy.Request(url=url, callback=self.parse,

                              cookies={'DSGVO_ZUSAGE_V1':'true'})



def parse(self, response):

    base_query = '//div[contains(@class,"contentLeft")]/ul[contains(@class, "stories")]/li'

    articles = response.xpath(base_query)

    for index, value in enumerate(articles):

        article_url = validateResponse(response,

                                       base_query + '/div[contains(@class,"text")]/h3/a/@href',

                                       index)

        request = scrapy.Request(response.urljoin(article_url), callback=self.parseSingleArticle, cookies={'DSGVO_ZUSAGE_V1':'true'})

        yield request



def parseSingleArticle(self, response):

    article_content = ''

    article_date = validateResponse(response, "//h6[contains(@class,'info')]/span[contains(@class,'date')]/text()")

    article_title = validateResponse(response, "//h1[contains(@itemprop,'headline')]/text()")

    query = "//div[contains(@class,'copytext')]//child::text()"

    article_content_response = response.xpath(query)

    for index, value in enumerate(article_content_response):

        article_content += " " + validateResponse(response, query, index)

    yield self.article_to_pipeline(article_content, response.url, article_date, article_title)



def article_to_pipeline(self, article_content, url, article_date, article_title):

    article_item = ArticleItem()

    # some other stuff

    return article_item

And for some articles this works perfect. From the debug:

2018-11-24 10:01:58 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000091420120/60-Stunden-pro-Woche-Fuer-Unternehmer-ist-das-die-Realitaet>

Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38

And I get everything I wanted for this article.

But then there are some articles where it doesn't work. For example this doesn't return anything:

2018-11-24 10:02:07 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000090136491/Lufthunderter-fuer-E-Autos-faellt>

Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38

Anybody any ideas? I have tried other cookies but MGUID and DSGVO_ZUSAGE_V1 are the only ones that make a difference.

python cookies web-scraping scrapy scrapy-spider

asked Nov 24 '18 at 9:15

drowyek

asked Nov 24 '18 at 9:15

drowyek

asked Nov 24 '18 at 9:15

drowyek

asked Nov 24 '18 at 9:15

drowyek

asked Nov 24 '18 at 9:15

drowyek

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53456765%2fscrapy-doesnt-keep-cookies-for-some-pages%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

aPbmZNvaDSAhUH7,zgZha6sfh7LpO8rbeYHgtw1pouFYum4Ch5JJqfkW,69TC NJnL2DzaY8,25vHjaDmGD

搜尋此網誌

Tukukkk