Scrapy doesn't keep cookies for some pages
I am trying to parse the articles from this site: https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t
When you visit the website for the first time, you are prompted to accept cookies. It seems that they store the consent in DSGVO_ZUSAGE_V1:true
because when I scrape like this it works:
def start_requests(self):
urls = [
'https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse,
cookies={'DSGVO_ZUSAGE_V1':'true'})
def parse(self, response):
base_query = '//div[contains(@class,"contentLeft")]/ul[contains(@class, "stories")]/li'
articles = response.xpath(base_query)
for index, value in enumerate(articles):
article_url = validateResponse(response,
base_query + '/div[contains(@class,"text")]/h3/a/@href',
index)
request = scrapy.Request(response.urljoin(article_url), callback=self.parseSingleArticle, cookies={'DSGVO_ZUSAGE_V1':'true'})
yield request
def parseSingleArticle(self, response):
article_content = ''
article_date = validateResponse(response, "//h6[contains(@class,'info')]/span[contains(@class,'date')]/text()")
article_title = validateResponse(response, "//h1[contains(@itemprop,'headline')]/text()")
query = "//div[contains(@class,'copytext')]//child::text()"
article_content_response = response.xpath(query)
for index, value in enumerate(article_content_response):
article_content += " " + validateResponse(response, query, index)
yield self.article_to_pipeline(article_content, response.url, article_date, article_title)
def article_to_pipeline(self, article_content, url, article_date, article_title):
article_item = ArticleItem()
# some other stuff
return article_item
And for some articles this works perfect. From the debug:
2018-11-24 10:01:58 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000091420120/60-Stunden-pro-Woche-Fuer-Unternehmer-ist-das-die-Realitaet>
Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38
And I get everything I wanted for this article.
But then there are some articles where it doesn't work. For example this doesn't return anything:
2018-11-24 10:02:07 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000090136491/Lufthunderter-fuer-E-Autos-faellt>
Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38
Also when I access this website through Chrome I get the "accept cookie prompt". Despite having accepted this cookie before for some other page on their website. When I do that they again save the confirmation under DSGVO_ZUSAGE_V1:true
and MGUID
(Deleting them from chrome puts the "accept cookie prompt" back).
Anybody any ideas? I have tried other cookies but MGUID
and DSGVO_ZUSAGE_V1
are the only ones that make a difference.
python cookies web-scraping scrapy scrapy-spider
add a comment |
I am trying to parse the articles from this site: https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t
When you visit the website for the first time, you are prompted to accept cookies. It seems that they store the consent in DSGVO_ZUSAGE_V1:true
because when I scrape like this it works:
def start_requests(self):
urls = [
'https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse,
cookies={'DSGVO_ZUSAGE_V1':'true'})
def parse(self, response):
base_query = '//div[contains(@class,"contentLeft")]/ul[contains(@class, "stories")]/li'
articles = response.xpath(base_query)
for index, value in enumerate(articles):
article_url = validateResponse(response,
base_query + '/div[contains(@class,"text")]/h3/a/@href',
index)
request = scrapy.Request(response.urljoin(article_url), callback=self.parseSingleArticle, cookies={'DSGVO_ZUSAGE_V1':'true'})
yield request
def parseSingleArticle(self, response):
article_content = ''
article_date = validateResponse(response, "//h6[contains(@class,'info')]/span[contains(@class,'date')]/text()")
article_title = validateResponse(response, "//h1[contains(@itemprop,'headline')]/text()")
query = "//div[contains(@class,'copytext')]//child::text()"
article_content_response = response.xpath(query)
for index, value in enumerate(article_content_response):
article_content += " " + validateResponse(response, query, index)
yield self.article_to_pipeline(article_content, response.url, article_date, article_title)
def article_to_pipeline(self, article_content, url, article_date, article_title):
article_item = ArticleItem()
# some other stuff
return article_item
And for some articles this works perfect. From the debug:
2018-11-24 10:01:58 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000091420120/60-Stunden-pro-Woche-Fuer-Unternehmer-ist-das-die-Realitaet>
Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38
And I get everything I wanted for this article.
But then there are some articles where it doesn't work. For example this doesn't return anything:
2018-11-24 10:02:07 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000090136491/Lufthunderter-fuer-E-Autos-faellt>
Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38
Also when I access this website through Chrome I get the "accept cookie prompt". Despite having accepted this cookie before for some other page on their website. When I do that they again save the confirmation under DSGVO_ZUSAGE_V1:true
and MGUID
(Deleting them from chrome puts the "accept cookie prompt" back).
Anybody any ideas? I have tried other cookies but MGUID
and DSGVO_ZUSAGE_V1
are the only ones that make a difference.
python cookies web-scraping scrapy scrapy-spider
add a comment |
I am trying to parse the articles from this site: https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t
When you visit the website for the first time, you are prompted to accept cookies. It seems that they store the consent in DSGVO_ZUSAGE_V1:true
because when I scrape like this it works:
def start_requests(self):
urls = [
'https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse,
cookies={'DSGVO_ZUSAGE_V1':'true'})
def parse(self, response):
base_query = '//div[contains(@class,"contentLeft")]/ul[contains(@class, "stories")]/li'
articles = response.xpath(base_query)
for index, value in enumerate(articles):
article_url = validateResponse(response,
base_query + '/div[contains(@class,"text")]/h3/a/@href',
index)
request = scrapy.Request(response.urljoin(article_url), callback=self.parseSingleArticle, cookies={'DSGVO_ZUSAGE_V1':'true'})
yield request
def parseSingleArticle(self, response):
article_content = ''
article_date = validateResponse(response, "//h6[contains(@class,'info')]/span[contains(@class,'date')]/text()")
article_title = validateResponse(response, "//h1[contains(@itemprop,'headline')]/text()")
query = "//div[contains(@class,'copytext')]//child::text()"
article_content_response = response.xpath(query)
for index, value in enumerate(article_content_response):
article_content += " " + validateResponse(response, query, index)
yield self.article_to_pipeline(article_content, response.url, article_date, article_title)
def article_to_pipeline(self, article_content, url, article_date, article_title):
article_item = ArticleItem()
# some other stuff
return article_item
And for some articles this works perfect. From the debug:
2018-11-24 10:01:58 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000091420120/60-Stunden-pro-Woche-Fuer-Unternehmer-ist-das-die-Realitaet>
Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38
And I get everything I wanted for this article.
But then there are some articles where it doesn't work. For example this doesn't return anything:
2018-11-24 10:02:07 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000090136491/Lufthunderter-fuer-E-Autos-faellt>
Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38
Also when I access this website through Chrome I get the "accept cookie prompt". Despite having accepted this cookie before for some other page on their website. When I do that they again save the confirmation under DSGVO_ZUSAGE_V1:true
and MGUID
(Deleting them from chrome puts the "accept cookie prompt" back).
Anybody any ideas? I have tried other cookies but MGUID
and DSGVO_ZUSAGE_V1
are the only ones that make a difference.
python cookies web-scraping scrapy scrapy-spider
I am trying to parse the articles from this site: https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t
When you visit the website for the first time, you are prompted to accept cookies. It seems that they store the consent in DSGVO_ZUSAGE_V1:true
because when I scrape like this it works:
def start_requests(self):
urls = [
'https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse,
cookies={'DSGVO_ZUSAGE_V1':'true'})
def parse(self, response):
base_query = '//div[contains(@class,"contentLeft")]/ul[contains(@class, "stories")]/li'
articles = response.xpath(base_query)
for index, value in enumerate(articles):
article_url = validateResponse(response,
base_query + '/div[contains(@class,"text")]/h3/a/@href',
index)
request = scrapy.Request(response.urljoin(article_url), callback=self.parseSingleArticle, cookies={'DSGVO_ZUSAGE_V1':'true'})
yield request
def parseSingleArticle(self, response):
article_content = ''
article_date = validateResponse(response, "//h6[contains(@class,'info')]/span[contains(@class,'date')]/text()")
article_title = validateResponse(response, "//h1[contains(@itemprop,'headline')]/text()")
query = "//div[contains(@class,'copytext')]//child::text()"
article_content_response = response.xpath(query)
for index, value in enumerate(article_content_response):
article_content += " " + validateResponse(response, query, index)
yield self.article_to_pipeline(article_content, response.url, article_date, article_title)
def article_to_pipeline(self, article_content, url, article_date, article_title):
article_item = ArticleItem()
# some other stuff
return article_item
And for some articles this works perfect. From the debug:
2018-11-24 10:01:58 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000091420120/60-Stunden-pro-Woche-Fuer-Unternehmer-ist-das-die-Realitaet>
Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38
And I get everything I wanted for this article.
But then there are some articles where it doesn't work. For example this doesn't return anything:
2018-11-24 10:02:07 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000090136491/Lufthunderter-fuer-E-Autos-faellt>
Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38
Also when I access this website through Chrome I get the "accept cookie prompt". Despite having accepted this cookie before for some other page on their website. When I do that they again save the confirmation under DSGVO_ZUSAGE_V1:true
and MGUID
(Deleting them from chrome puts the "accept cookie prompt" back).
Anybody any ideas? I have tried other cookies but MGUID
and DSGVO_ZUSAGE_V1
are the only ones that make a difference.
python cookies web-scraping scrapy scrapy-spider
python cookies web-scraping scrapy scrapy-spider
asked Nov 24 '18 at 9:15
drowyekdrowyek
12
12
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53456765%2fscrapy-doesnt-keep-cookies-for-some-pages%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53456765%2fscrapy-doesnt-keep-cookies-for-some-pages%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown