Too many requests error while crawling users reputation from Stack Overflow












0














I have a list of user ids and I'm interested in crawling their reputation.



I wrote a script using beautifulsoup that crawls users reputation. But the problem is, I get Too many requests error when my script has run for less than a minute. After that, I am unable to open the Stack Overflow manually on browser too.



My question is, how do I crawl the reputation without getting too many request error?



My code is given below:



for id in df['target']:
url='https://stackoverflow.com/users/'+str(id)
print(url)
response=get(url)
html_soup=BeautifulSoup(response.text, 'html.parser')
site_title = html_soup.find("title").contents[0]
if "Page Not Found - Stack Overflow" in site_title:
reputation="NA"
else:
reputation=(html_soup.find(class_='grid--cell fs-title fc-dark')).contents[0].replace(',', "")
print(reputation)









share|improve this question




















  • 2




    Why are you doing this with a web crawler instead of using the Stack Exchange Data Explorer?
    – Barmar
    Nov 20 at 20:16










  • Duplicate of stackoverflow.com/questions/22786068/…
    – Keith John Hutchison
    Nov 20 at 20:16










  • @Barmar with that I won't get this error?
    – enjal
    Nov 20 at 20:40










  • You won't be accessing the webserver at all.
    – Barmar
    Nov 20 at 20:41










  • Can you please add the imports to source in the question.
    – Keith John Hutchison
    Nov 20 at 20:46
















0














I have a list of user ids and I'm interested in crawling their reputation.



I wrote a script using beautifulsoup that crawls users reputation. But the problem is, I get Too many requests error when my script has run for less than a minute. After that, I am unable to open the Stack Overflow manually on browser too.



My question is, how do I crawl the reputation without getting too many request error?



My code is given below:



for id in df['target']:
url='https://stackoverflow.com/users/'+str(id)
print(url)
response=get(url)
html_soup=BeautifulSoup(response.text, 'html.parser')
site_title = html_soup.find("title").contents[0]
if "Page Not Found - Stack Overflow" in site_title:
reputation="NA"
else:
reputation=(html_soup.find(class_='grid--cell fs-title fc-dark')).contents[0].replace(',', "")
print(reputation)









share|improve this question




















  • 2




    Why are you doing this with a web crawler instead of using the Stack Exchange Data Explorer?
    – Barmar
    Nov 20 at 20:16










  • Duplicate of stackoverflow.com/questions/22786068/…
    – Keith John Hutchison
    Nov 20 at 20:16










  • @Barmar with that I won't get this error?
    – enjal
    Nov 20 at 20:40










  • You won't be accessing the webserver at all.
    – Barmar
    Nov 20 at 20:41










  • Can you please add the imports to source in the question.
    – Keith John Hutchison
    Nov 20 at 20:46














0












0








0







I have a list of user ids and I'm interested in crawling their reputation.



I wrote a script using beautifulsoup that crawls users reputation. But the problem is, I get Too many requests error when my script has run for less than a minute. After that, I am unable to open the Stack Overflow manually on browser too.



My question is, how do I crawl the reputation without getting too many request error?



My code is given below:



for id in df['target']:
url='https://stackoverflow.com/users/'+str(id)
print(url)
response=get(url)
html_soup=BeautifulSoup(response.text, 'html.parser')
site_title = html_soup.find("title").contents[0]
if "Page Not Found - Stack Overflow" in site_title:
reputation="NA"
else:
reputation=(html_soup.find(class_='grid--cell fs-title fc-dark')).contents[0].replace(',', "")
print(reputation)









share|improve this question















I have a list of user ids and I'm interested in crawling their reputation.



I wrote a script using beautifulsoup that crawls users reputation. But the problem is, I get Too many requests error when my script has run for less than a minute. After that, I am unable to open the Stack Overflow manually on browser too.



My question is, how do I crawl the reputation without getting too many request error?



My code is given below:



for id in df['target']:
url='https://stackoverflow.com/users/'+str(id)
print(url)
response=get(url)
html_soup=BeautifulSoup(response.text, 'html.parser')
site_title = html_soup.find("title").contents[0]
if "Page Not Found - Stack Overflow" in site_title:
reputation="NA"
else:
reputation=(html_soup.find(class_='grid--cell fs-title fc-dark')).contents[0].replace(',', "")
print(reputation)






python beautifulsoup web-crawler






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 26 at 1:38









Pang

6,8601563101




6,8601563101










asked Nov 20 at 20:08









enjal

3341819




3341819








  • 2




    Why are you doing this with a web crawler instead of using the Stack Exchange Data Explorer?
    – Barmar
    Nov 20 at 20:16










  • Duplicate of stackoverflow.com/questions/22786068/…
    – Keith John Hutchison
    Nov 20 at 20:16










  • @Barmar with that I won't get this error?
    – enjal
    Nov 20 at 20:40










  • You won't be accessing the webserver at all.
    – Barmar
    Nov 20 at 20:41










  • Can you please add the imports to source in the question.
    – Keith John Hutchison
    Nov 20 at 20:46














  • 2




    Why are you doing this with a web crawler instead of using the Stack Exchange Data Explorer?
    – Barmar
    Nov 20 at 20:16










  • Duplicate of stackoverflow.com/questions/22786068/…
    – Keith John Hutchison
    Nov 20 at 20:16










  • @Barmar with that I won't get this error?
    – enjal
    Nov 20 at 20:40










  • You won't be accessing the webserver at all.
    – Barmar
    Nov 20 at 20:41










  • Can you please add the imports to source in the question.
    – Keith John Hutchison
    Nov 20 at 20:46








2




2




Why are you doing this with a web crawler instead of using the Stack Exchange Data Explorer?
– Barmar
Nov 20 at 20:16




Why are you doing this with a web crawler instead of using the Stack Exchange Data Explorer?
– Barmar
Nov 20 at 20:16












Duplicate of stackoverflow.com/questions/22786068/…
– Keith John Hutchison
Nov 20 at 20:16




Duplicate of stackoverflow.com/questions/22786068/…
– Keith John Hutchison
Nov 20 at 20:16












@Barmar with that I won't get this error?
– enjal
Nov 20 at 20:40




@Barmar with that I won't get this error?
– enjal
Nov 20 at 20:40












You won't be accessing the webserver at all.
– Barmar
Nov 20 at 20:41




You won't be accessing the webserver at all.
– Barmar
Nov 20 at 20:41












Can you please add the imports to source in the question.
– Keith John Hutchison
Nov 20 at 20:46




Can you please add the imports to source in the question.
– Keith John Hutchison
Nov 20 at 20:46












2 Answers
2






active

oldest

votes


















0














You can check if response.status_code == 429 and see if there is a value in the response telling you how long to wait for, then wait for the number of seconds you've been asked to.



I duplicated the issue here.
I couldn't find any information on how long to wait in the content or the headers.



I suggest putting in some throttles and adjust until you're happy with the results.



See https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site for an example for getting user reputations from the Stack Exchange Data Explorer.



Example follows.



#!/usr/bin/env python

import time
import requests
from bs4 import BeautifulSoup

df={}
df['target']=[ ... ] # see https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site

throttle = 2
whoa = 450

with open('results.txt','w') as file_handler:
file_handler.write('urltreputationn')
for id in df['target']:
time.sleep(throttle)
url='https://stackoverflow.com/users/'+str(id)
print(url)
response=requests.get(url)
while response.status_code == 429:
print(response.content)
print(response.headers)
time.sleep(whoa)
response=requests.get(url)
html_soup=BeautifulSoup(response.text, 'html.parser')
site_title = html_soup.find("title").contents[0]
if "Page Not Found - Stack Overflow" in site_title:
reputation="NA"
else:
reputation=(html_soup.find(class_='grid--cell fs-title fc-dark')).contents[0].replace(',', "")
print('reputation: %s' % reputation)
file_handler.write('%st%sn' % (url,reputation))


Example error content.



<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Too Many Requests - Stack Exchange</title>
<style type="text/css">
body
{
color: #333;
font-family: 'Helvetica Neue', Arial, sans-serif;
font-size: 14px;
background: #fff url('img/bg-noise.png') repeat left top;
line-height: 1.4;
}
h1
{
font-size: 170%;
line-height: 34px;
font-weight: normal;
}
a { color: #366fb3; }
a:visited { color: #12457c; }
.wrapper {
width:960px;
margin: 100px auto;
text-align:left;
}
.msg {
float: left;
width: 700px;
padding-top: 18px;
margin-left: 18px;
}
</style>
</head>
<body>
<div class="wrapper">
<div style="float: left;">
<img src="https://cdn.sstatic.net/stackexchange/img/apple-touch-icon.png" alt="Stack Exchange" />
</div>
<div class="msg">
<h1>Too many requests</h1>
<p>This IP address (nnn.nnn.nnn.nnn) has performed an unusual high number of requests and has been temporarily rate limited. If you believe this to be in error, please contact us at <a href="mailto:team@stackexchange.com?Subject=Rate%20limiting%20of%20nnn.nnn.nnn.nnn%20(Request%20ID%3A%202158483152-SYD)">team@stackexchange.com</a>.</p>
<p>When contacting us, please include the following information in the email:</p>
<p>Method: rate limit</p>
<p>XID: 2158483152-SYD</p>
<p>IP: nnn.nnn.nnn.nnn</p>
<p>X-Forwarded-For: nnn.nnn.nnn.nnn</p>
<p>User-Agent: python-requests/2.20.1</p>
<p>Reason: Request rate.</p>
<p>Time: Tue, 20 Nov 2018 21:10:55 GMT</p>
<p>URL: stackoverflow.com/users/nnnnnnn</p>
<p>Browser Location: <span id="jslocation">(not loaded)</span></p>
</div>
</div>
<script>document.getElementById('jslocation').innerHTML = window.location.href;</script>
</body>
</html>


Example error headers.



{
"Content-Length": "2054",
"Via": "1.1 varnish",
"X-Cache": "MISS",
"X-DNS-Prefetch-Control": "off",
"Accept-Ranges": "bytes",
"X-Timer": "S1542748255.394076,VS0,VE0",
"Server": "Varnish",
"Retry-After": "0",
"Connection": "close",
"X-Served-By": "cache-syd18924-SYD",
"X-Cache-Hits": "0",
"Date": "Tue, 20 Nov 2018 21:10:55 GMT",
"Content-Type": "text/html"
}






share|improve this answer























  • the error headers don't tell how long to wait for which is sad. Thanks for the help but a wait time for 15 minutes is a lot. I was just wondering - if it just keeps on checking the response until response is not 429 and then breaks and does the normal operation. Is wait time still necessary then?
    – enjal
    Nov 20 at 21:31












  • You can adjust the throttle amounts as you wish. If the throttle is set correctly you'll never hit 'whoa'. I'm not sure how long the rate limit is. Stack overflow should send back information saying wait x number of seconds.
    – Keith John Hutchison
    Nov 20 at 21:35










  • I'm doing a run with throttle = 2, whoa = 450.
    – Keith John Hutchison
    Nov 20 at 21:46










  • Which processed 500 urls with no issues.
    – Keith John Hutchison
    Nov 20 at 22:13






  • 1




    It's still running. But I think this will get the job done. Thanks.
    – enjal
    Nov 24 at 4:35



















1














I suggest using the Python time module and throwing a time.sleep(5) in your for loop. The error is coming from you making too many requests in too short a time period. You may have to play around with the actual sleep time to get it right, though.






share|improve this answer





















  • I don't think that putting sleep is a good thing to do.
    – enjal
    Nov 20 at 20:58






  • 1




    @enjal why do you think that?
    – ahota
    Nov 20 at 21:01










  • Because it difficult to get the exact sleep time.
    – enjal
    Nov 20 at 21:17











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53400775%2ftoo-many-requests-error-while-crawling-users-reputation-from-stack-overflow%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














You can check if response.status_code == 429 and see if there is a value in the response telling you how long to wait for, then wait for the number of seconds you've been asked to.



I duplicated the issue here.
I couldn't find any information on how long to wait in the content or the headers.



I suggest putting in some throttles and adjust until you're happy with the results.



See https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site for an example for getting user reputations from the Stack Exchange Data Explorer.



Example follows.



#!/usr/bin/env python

import time
import requests
from bs4 import BeautifulSoup

df={}
df['target']=[ ... ] # see https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site

throttle = 2
whoa = 450

with open('results.txt','w') as file_handler:
file_handler.write('urltreputationn')
for id in df['target']:
time.sleep(throttle)
url='https://stackoverflow.com/users/'+str(id)
print(url)
response=requests.get(url)
while response.status_code == 429:
print(response.content)
print(response.headers)
time.sleep(whoa)
response=requests.get(url)
html_soup=BeautifulSoup(response.text, 'html.parser')
site_title = html_soup.find("title").contents[0]
if "Page Not Found - Stack Overflow" in site_title:
reputation="NA"
else:
reputation=(html_soup.find(class_='grid--cell fs-title fc-dark')).contents[0].replace(',', "")
print('reputation: %s' % reputation)
file_handler.write('%st%sn' % (url,reputation))


Example error content.



<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Too Many Requests - Stack Exchange</title>
<style type="text/css">
body
{
color: #333;
font-family: 'Helvetica Neue', Arial, sans-serif;
font-size: 14px;
background: #fff url('img/bg-noise.png') repeat left top;
line-height: 1.4;
}
h1
{
font-size: 170%;
line-height: 34px;
font-weight: normal;
}
a { color: #366fb3; }
a:visited { color: #12457c; }
.wrapper {
width:960px;
margin: 100px auto;
text-align:left;
}
.msg {
float: left;
width: 700px;
padding-top: 18px;
margin-left: 18px;
}
</style>
</head>
<body>
<div class="wrapper">
<div style="float: left;">
<img src="https://cdn.sstatic.net/stackexchange/img/apple-touch-icon.png" alt="Stack Exchange" />
</div>
<div class="msg">
<h1>Too many requests</h1>
<p>This IP address (nnn.nnn.nnn.nnn) has performed an unusual high number of requests and has been temporarily rate limited. If you believe this to be in error, please contact us at <a href="mailto:team@stackexchange.com?Subject=Rate%20limiting%20of%20nnn.nnn.nnn.nnn%20(Request%20ID%3A%202158483152-SYD)">team@stackexchange.com</a>.</p>
<p>When contacting us, please include the following information in the email:</p>
<p>Method: rate limit</p>
<p>XID: 2158483152-SYD</p>
<p>IP: nnn.nnn.nnn.nnn</p>
<p>X-Forwarded-For: nnn.nnn.nnn.nnn</p>
<p>User-Agent: python-requests/2.20.1</p>
<p>Reason: Request rate.</p>
<p>Time: Tue, 20 Nov 2018 21:10:55 GMT</p>
<p>URL: stackoverflow.com/users/nnnnnnn</p>
<p>Browser Location: <span id="jslocation">(not loaded)</span></p>
</div>
</div>
<script>document.getElementById('jslocation').innerHTML = window.location.href;</script>
</body>
</html>


Example error headers.



{
"Content-Length": "2054",
"Via": "1.1 varnish",
"X-Cache": "MISS",
"X-DNS-Prefetch-Control": "off",
"Accept-Ranges": "bytes",
"X-Timer": "S1542748255.394076,VS0,VE0",
"Server": "Varnish",
"Retry-After": "0",
"Connection": "close",
"X-Served-By": "cache-syd18924-SYD",
"X-Cache-Hits": "0",
"Date": "Tue, 20 Nov 2018 21:10:55 GMT",
"Content-Type": "text/html"
}






share|improve this answer























  • the error headers don't tell how long to wait for which is sad. Thanks for the help but a wait time for 15 minutes is a lot. I was just wondering - if it just keeps on checking the response until response is not 429 and then breaks and does the normal operation. Is wait time still necessary then?
    – enjal
    Nov 20 at 21:31












  • You can adjust the throttle amounts as you wish. If the throttle is set correctly you'll never hit 'whoa'. I'm not sure how long the rate limit is. Stack overflow should send back information saying wait x number of seconds.
    – Keith John Hutchison
    Nov 20 at 21:35










  • I'm doing a run with throttle = 2, whoa = 450.
    – Keith John Hutchison
    Nov 20 at 21:46










  • Which processed 500 urls with no issues.
    – Keith John Hutchison
    Nov 20 at 22:13






  • 1




    It's still running. But I think this will get the job done. Thanks.
    – enjal
    Nov 24 at 4:35
















0














You can check if response.status_code == 429 and see if there is a value in the response telling you how long to wait for, then wait for the number of seconds you've been asked to.



I duplicated the issue here.
I couldn't find any information on how long to wait in the content or the headers.



I suggest putting in some throttles and adjust until you're happy with the results.



See https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site for an example for getting user reputations from the Stack Exchange Data Explorer.



Example follows.



#!/usr/bin/env python

import time
import requests
from bs4 import BeautifulSoup

df={}
df['target']=[ ... ] # see https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site

throttle = 2
whoa = 450

with open('results.txt','w') as file_handler:
file_handler.write('urltreputationn')
for id in df['target']:
time.sleep(throttle)
url='https://stackoverflow.com/users/'+str(id)
print(url)
response=requests.get(url)
while response.status_code == 429:
print(response.content)
print(response.headers)
time.sleep(whoa)
response=requests.get(url)
html_soup=BeautifulSoup(response.text, 'html.parser')
site_title = html_soup.find("title").contents[0]
if "Page Not Found - Stack Overflow" in site_title:
reputation="NA"
else:
reputation=(html_soup.find(class_='grid--cell fs-title fc-dark')).contents[0].replace(',', "")
print('reputation: %s' % reputation)
file_handler.write('%st%sn' % (url,reputation))


Example error content.



<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Too Many Requests - Stack Exchange</title>
<style type="text/css">
body
{
color: #333;
font-family: 'Helvetica Neue', Arial, sans-serif;
font-size: 14px;
background: #fff url('img/bg-noise.png') repeat left top;
line-height: 1.4;
}
h1
{
font-size: 170%;
line-height: 34px;
font-weight: normal;
}
a { color: #366fb3; }
a:visited { color: #12457c; }
.wrapper {
width:960px;
margin: 100px auto;
text-align:left;
}
.msg {
float: left;
width: 700px;
padding-top: 18px;
margin-left: 18px;
}
</style>
</head>
<body>
<div class="wrapper">
<div style="float: left;">
<img src="https://cdn.sstatic.net/stackexchange/img/apple-touch-icon.png" alt="Stack Exchange" />
</div>
<div class="msg">
<h1>Too many requests</h1>
<p>This IP address (nnn.nnn.nnn.nnn) has performed an unusual high number of requests and has been temporarily rate limited. If you believe this to be in error, please contact us at <a href="mailto:team@stackexchange.com?Subject=Rate%20limiting%20of%20nnn.nnn.nnn.nnn%20(Request%20ID%3A%202158483152-SYD)">team@stackexchange.com</a>.</p>
<p>When contacting us, please include the following information in the email:</p>
<p>Method: rate limit</p>
<p>XID: 2158483152-SYD</p>
<p>IP: nnn.nnn.nnn.nnn</p>
<p>X-Forwarded-For: nnn.nnn.nnn.nnn</p>
<p>User-Agent: python-requests/2.20.1</p>
<p>Reason: Request rate.</p>
<p>Time: Tue, 20 Nov 2018 21:10:55 GMT</p>
<p>URL: stackoverflow.com/users/nnnnnnn</p>
<p>Browser Location: <span id="jslocation">(not loaded)</span></p>
</div>
</div>
<script>document.getElementById('jslocation').innerHTML = window.location.href;</script>
</body>
</html>


Example error headers.



{
"Content-Length": "2054",
"Via": "1.1 varnish",
"X-Cache": "MISS",
"X-DNS-Prefetch-Control": "off",
"Accept-Ranges": "bytes",
"X-Timer": "S1542748255.394076,VS0,VE0",
"Server": "Varnish",
"Retry-After": "0",
"Connection": "close",
"X-Served-By": "cache-syd18924-SYD",
"X-Cache-Hits": "0",
"Date": "Tue, 20 Nov 2018 21:10:55 GMT",
"Content-Type": "text/html"
}






share|improve this answer























  • the error headers don't tell how long to wait for which is sad. Thanks for the help but a wait time for 15 minutes is a lot. I was just wondering - if it just keeps on checking the response until response is not 429 and then breaks and does the normal operation. Is wait time still necessary then?
    – enjal
    Nov 20 at 21:31












  • You can adjust the throttle amounts as you wish. If the throttle is set correctly you'll never hit 'whoa'. I'm not sure how long the rate limit is. Stack overflow should send back information saying wait x number of seconds.
    – Keith John Hutchison
    Nov 20 at 21:35










  • I'm doing a run with throttle = 2, whoa = 450.
    – Keith John Hutchison
    Nov 20 at 21:46










  • Which processed 500 urls with no issues.
    – Keith John Hutchison
    Nov 20 at 22:13






  • 1




    It's still running. But I think this will get the job done. Thanks.
    – enjal
    Nov 24 at 4:35














0












0








0






You can check if response.status_code == 429 and see if there is a value in the response telling you how long to wait for, then wait for the number of seconds you've been asked to.



I duplicated the issue here.
I couldn't find any information on how long to wait in the content or the headers.



I suggest putting in some throttles and adjust until you're happy with the results.



See https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site for an example for getting user reputations from the Stack Exchange Data Explorer.



Example follows.



#!/usr/bin/env python

import time
import requests
from bs4 import BeautifulSoup

df={}
df['target']=[ ... ] # see https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site

throttle = 2
whoa = 450

with open('results.txt','w') as file_handler:
file_handler.write('urltreputationn')
for id in df['target']:
time.sleep(throttle)
url='https://stackoverflow.com/users/'+str(id)
print(url)
response=requests.get(url)
while response.status_code == 429:
print(response.content)
print(response.headers)
time.sleep(whoa)
response=requests.get(url)
html_soup=BeautifulSoup(response.text, 'html.parser')
site_title = html_soup.find("title").contents[0]
if "Page Not Found - Stack Overflow" in site_title:
reputation="NA"
else:
reputation=(html_soup.find(class_='grid--cell fs-title fc-dark')).contents[0].replace(',', "")
print('reputation: %s' % reputation)
file_handler.write('%st%sn' % (url,reputation))


Example error content.



<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Too Many Requests - Stack Exchange</title>
<style type="text/css">
body
{
color: #333;
font-family: 'Helvetica Neue', Arial, sans-serif;
font-size: 14px;
background: #fff url('img/bg-noise.png') repeat left top;
line-height: 1.4;
}
h1
{
font-size: 170%;
line-height: 34px;
font-weight: normal;
}
a { color: #366fb3; }
a:visited { color: #12457c; }
.wrapper {
width:960px;
margin: 100px auto;
text-align:left;
}
.msg {
float: left;
width: 700px;
padding-top: 18px;
margin-left: 18px;
}
</style>
</head>
<body>
<div class="wrapper">
<div style="float: left;">
<img src="https://cdn.sstatic.net/stackexchange/img/apple-touch-icon.png" alt="Stack Exchange" />
</div>
<div class="msg">
<h1>Too many requests</h1>
<p>This IP address (nnn.nnn.nnn.nnn) has performed an unusual high number of requests and has been temporarily rate limited. If you believe this to be in error, please contact us at <a href="mailto:team@stackexchange.com?Subject=Rate%20limiting%20of%20nnn.nnn.nnn.nnn%20(Request%20ID%3A%202158483152-SYD)">team@stackexchange.com</a>.</p>
<p>When contacting us, please include the following information in the email:</p>
<p>Method: rate limit</p>
<p>XID: 2158483152-SYD</p>
<p>IP: nnn.nnn.nnn.nnn</p>
<p>X-Forwarded-For: nnn.nnn.nnn.nnn</p>
<p>User-Agent: python-requests/2.20.1</p>
<p>Reason: Request rate.</p>
<p>Time: Tue, 20 Nov 2018 21:10:55 GMT</p>
<p>URL: stackoverflow.com/users/nnnnnnn</p>
<p>Browser Location: <span id="jslocation">(not loaded)</span></p>
</div>
</div>
<script>document.getElementById('jslocation').innerHTML = window.location.href;</script>
</body>
</html>


Example error headers.



{
"Content-Length": "2054",
"Via": "1.1 varnish",
"X-Cache": "MISS",
"X-DNS-Prefetch-Control": "off",
"Accept-Ranges": "bytes",
"X-Timer": "S1542748255.394076,VS0,VE0",
"Server": "Varnish",
"Retry-After": "0",
"Connection": "close",
"X-Served-By": "cache-syd18924-SYD",
"X-Cache-Hits": "0",
"Date": "Tue, 20 Nov 2018 21:10:55 GMT",
"Content-Type": "text/html"
}






share|improve this answer














You can check if response.status_code == 429 and see if there is a value in the response telling you how long to wait for, then wait for the number of seconds you've been asked to.



I duplicated the issue here.
I couldn't find any information on how long to wait in the content or the headers.



I suggest putting in some throttles and adjust until you're happy with the results.



See https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site for an example for getting user reputations from the Stack Exchange Data Explorer.



Example follows.



#!/usr/bin/env python

import time
import requests
from bs4 import BeautifulSoup

df={}
df['target']=[ ... ] # see https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site

throttle = 2
whoa = 450

with open('results.txt','w') as file_handler:
file_handler.write('urltreputationn')
for id in df['target']:
time.sleep(throttle)
url='https://stackoverflow.com/users/'+str(id)
print(url)
response=requests.get(url)
while response.status_code == 429:
print(response.content)
print(response.headers)
time.sleep(whoa)
response=requests.get(url)
html_soup=BeautifulSoup(response.text, 'html.parser')
site_title = html_soup.find("title").contents[0]
if "Page Not Found - Stack Overflow" in site_title:
reputation="NA"
else:
reputation=(html_soup.find(class_='grid--cell fs-title fc-dark')).contents[0].replace(',', "")
print('reputation: %s' % reputation)
file_handler.write('%st%sn' % (url,reputation))


Example error content.



<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Too Many Requests - Stack Exchange</title>
<style type="text/css">
body
{
color: #333;
font-family: 'Helvetica Neue', Arial, sans-serif;
font-size: 14px;
background: #fff url('img/bg-noise.png') repeat left top;
line-height: 1.4;
}
h1
{
font-size: 170%;
line-height: 34px;
font-weight: normal;
}
a { color: #366fb3; }
a:visited { color: #12457c; }
.wrapper {
width:960px;
margin: 100px auto;
text-align:left;
}
.msg {
float: left;
width: 700px;
padding-top: 18px;
margin-left: 18px;
}
</style>
</head>
<body>
<div class="wrapper">
<div style="float: left;">
<img src="https://cdn.sstatic.net/stackexchange/img/apple-touch-icon.png" alt="Stack Exchange" />
</div>
<div class="msg">
<h1>Too many requests</h1>
<p>This IP address (nnn.nnn.nnn.nnn) has performed an unusual high number of requests and has been temporarily rate limited. If you believe this to be in error, please contact us at <a href="mailto:team@stackexchange.com?Subject=Rate%20limiting%20of%20nnn.nnn.nnn.nnn%20(Request%20ID%3A%202158483152-SYD)">team@stackexchange.com</a>.</p>
<p>When contacting us, please include the following information in the email:</p>
<p>Method: rate limit</p>
<p>XID: 2158483152-SYD</p>
<p>IP: nnn.nnn.nnn.nnn</p>
<p>X-Forwarded-For: nnn.nnn.nnn.nnn</p>
<p>User-Agent: python-requests/2.20.1</p>
<p>Reason: Request rate.</p>
<p>Time: Tue, 20 Nov 2018 21:10:55 GMT</p>
<p>URL: stackoverflow.com/users/nnnnnnn</p>
<p>Browser Location: <span id="jslocation">(not loaded)</span></p>
</div>
</div>
<script>document.getElementById('jslocation').innerHTML = window.location.href;</script>
</body>
</html>


Example error headers.



{
"Content-Length": "2054",
"Via": "1.1 varnish",
"X-Cache": "MISS",
"X-DNS-Prefetch-Control": "off",
"Accept-Ranges": "bytes",
"X-Timer": "S1542748255.394076,VS0,VE0",
"Server": "Varnish",
"Retry-After": "0",
"Connection": "close",
"X-Served-By": "cache-syd18924-SYD",
"X-Cache-Hits": "0",
"Date": "Tue, 20 Nov 2018 21:10:55 GMT",
"Content-Type": "text/html"
}







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 20 at 22:49

























answered Nov 20 at 20:20









Keith John Hutchison

2,54042531




2,54042531












  • the error headers don't tell how long to wait for which is sad. Thanks for the help but a wait time for 15 minutes is a lot. I was just wondering - if it just keeps on checking the response until response is not 429 and then breaks and does the normal operation. Is wait time still necessary then?
    – enjal
    Nov 20 at 21:31












  • You can adjust the throttle amounts as you wish. If the throttle is set correctly you'll never hit 'whoa'. I'm not sure how long the rate limit is. Stack overflow should send back information saying wait x number of seconds.
    – Keith John Hutchison
    Nov 20 at 21:35










  • I'm doing a run with throttle = 2, whoa = 450.
    – Keith John Hutchison
    Nov 20 at 21:46










  • Which processed 500 urls with no issues.
    – Keith John Hutchison
    Nov 20 at 22:13






  • 1




    It's still running. But I think this will get the job done. Thanks.
    – enjal
    Nov 24 at 4:35


















  • the error headers don't tell how long to wait for which is sad. Thanks for the help but a wait time for 15 minutes is a lot. I was just wondering - if it just keeps on checking the response until response is not 429 and then breaks and does the normal operation. Is wait time still necessary then?
    – enjal
    Nov 20 at 21:31












  • You can adjust the throttle amounts as you wish. If the throttle is set correctly you'll never hit 'whoa'. I'm not sure how long the rate limit is. Stack overflow should send back information saying wait x number of seconds.
    – Keith John Hutchison
    Nov 20 at 21:35










  • I'm doing a run with throttle = 2, whoa = 450.
    – Keith John Hutchison
    Nov 20 at 21:46










  • Which processed 500 urls with no issues.
    – Keith John Hutchison
    Nov 20 at 22:13






  • 1




    It's still running. But I think this will get the job done. Thanks.
    – enjal
    Nov 24 at 4:35
















the error headers don't tell how long to wait for which is sad. Thanks for the help but a wait time for 15 minutes is a lot. I was just wondering - if it just keeps on checking the response until response is not 429 and then breaks and does the normal operation. Is wait time still necessary then?
– enjal
Nov 20 at 21:31






the error headers don't tell how long to wait for which is sad. Thanks for the help but a wait time for 15 minutes is a lot. I was just wondering - if it just keeps on checking the response until response is not 429 and then breaks and does the normal operation. Is wait time still necessary then?
– enjal
Nov 20 at 21:31














You can adjust the throttle amounts as you wish. If the throttle is set correctly you'll never hit 'whoa'. I'm not sure how long the rate limit is. Stack overflow should send back information saying wait x number of seconds.
– Keith John Hutchison
Nov 20 at 21:35




You can adjust the throttle amounts as you wish. If the throttle is set correctly you'll never hit 'whoa'. I'm not sure how long the rate limit is. Stack overflow should send back information saying wait x number of seconds.
– Keith John Hutchison
Nov 20 at 21:35












I'm doing a run with throttle = 2, whoa = 450.
– Keith John Hutchison
Nov 20 at 21:46




I'm doing a run with throttle = 2, whoa = 450.
– Keith John Hutchison
Nov 20 at 21:46












Which processed 500 urls with no issues.
– Keith John Hutchison
Nov 20 at 22:13




Which processed 500 urls with no issues.
– Keith John Hutchison
Nov 20 at 22:13




1




1




It's still running. But I think this will get the job done. Thanks.
– enjal
Nov 24 at 4:35




It's still running. But I think this will get the job done. Thanks.
– enjal
Nov 24 at 4:35













1














I suggest using the Python time module and throwing a time.sleep(5) in your for loop. The error is coming from you making too many requests in too short a time period. You may have to play around with the actual sleep time to get it right, though.






share|improve this answer





















  • I don't think that putting sleep is a good thing to do.
    – enjal
    Nov 20 at 20:58






  • 1




    @enjal why do you think that?
    – ahota
    Nov 20 at 21:01










  • Because it difficult to get the exact sleep time.
    – enjal
    Nov 20 at 21:17
















1














I suggest using the Python time module and throwing a time.sleep(5) in your for loop. The error is coming from you making too many requests in too short a time period. You may have to play around with the actual sleep time to get it right, though.






share|improve this answer





















  • I don't think that putting sleep is a good thing to do.
    – enjal
    Nov 20 at 20:58






  • 1




    @enjal why do you think that?
    – ahota
    Nov 20 at 21:01










  • Because it difficult to get the exact sleep time.
    – enjal
    Nov 20 at 21:17














1












1








1






I suggest using the Python time module and throwing a time.sleep(5) in your for loop. The error is coming from you making too many requests in too short a time period. You may have to play around with the actual sleep time to get it right, though.






share|improve this answer












I suggest using the Python time module and throwing a time.sleep(5) in your for loop. The error is coming from you making too many requests in too short a time period. You may have to play around with the actual sleep time to get it right, though.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 20 at 20:12









ahota

37128




37128












  • I don't think that putting sleep is a good thing to do.
    – enjal
    Nov 20 at 20:58






  • 1




    @enjal why do you think that?
    – ahota
    Nov 20 at 21:01










  • Because it difficult to get the exact sleep time.
    – enjal
    Nov 20 at 21:17


















  • I don't think that putting sleep is a good thing to do.
    – enjal
    Nov 20 at 20:58






  • 1




    @enjal why do you think that?
    – ahota
    Nov 20 at 21:01










  • Because it difficult to get the exact sleep time.
    – enjal
    Nov 20 at 21:17
















I don't think that putting sleep is a good thing to do.
– enjal
Nov 20 at 20:58




I don't think that putting sleep is a good thing to do.
– enjal
Nov 20 at 20:58




1




1




@enjal why do you think that?
– ahota
Nov 20 at 21:01




@enjal why do you think that?
– ahota
Nov 20 at 21:01












Because it difficult to get the exact sleep time.
– enjal
Nov 20 at 21:17




Because it difficult to get the exact sleep time.
– enjal
Nov 20 at 21:17


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53400775%2ftoo-many-requests-error-while-crawling-users-reputation-from-stack-overflow%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

404 Error Contact Form 7 ajax form submitting

How to know if a Active Directory user can login interactively

TypeError: fit_transform() missing 1 required positional argument: 'X'