Too many requests error while crawling users reputation from Stack Overflow

I have a list of user ids and I'm interested in crawling their reputation.

I wrote a script using beautifulsoup that crawls users reputation. But the problem is, I get Too many requests error when my script has run for less than a minute. After that, I am unable to open the Stack Overflow manually on browser too.

My question is, how do I crawl the reputation without getting too many request error?

My code is given below:

for id in df['target']:

    url='https://stackoverflow.com/users/'+str(id)

    print(url)

    response=get(url)

    html_soup=BeautifulSoup(response.text, 'html.parser') 

    site_title = html_soup.find("title").contents[0]

    if "Page Not Found - Stack Overflow" in site_title:

        reputation="NA"

    else:    

        reputation=(html_soup.find(class_='grid--cell fs-title fc-dark')).contents[0].replace(',', "")

        print(reputation)

edited Nov 26 at 1:38

Pang

6,8601563101

asked Nov 20 at 20:08

enjal

3341819

2

Why are you doing this with a web crawler instead of using the Stack Exchange Data Explorer?
– Barmar
Nov 20 at 20:16

Duplicate of stackoverflow.com/questions/22786068/…
– Keith John Hutchison
Nov 20 at 20:16

@Barmar with that I won't get this error?
– enjal
Nov 20 at 20:40

You won't be accessing the webserver at all.
– Barmar
Nov 20 at 20:41

Can you please add the imports to source in the question.
– Keith John Hutchison
Nov 20 at 20:46

|
show 2 more comments

I have a list of user ids and I'm interested in crawling their reputation.

My question is, how do I crawl the reputation without getting too many request error?

My code is given below:

for id in df['target']:

    url='https://stackoverflow.com/users/'+str(id)

    print(url)

    response=get(url)

    html_soup=BeautifulSoup(response.text, 'html.parser') 

    site_title = html_soup.find("title").contents[0]

    if "Page Not Found - Stack Overflow" in site_title:

        reputation="NA"

    else:    

        reputation=(html_soup.find(class_='grid--cell fs-title fc-dark')).contents[0].replace(',', "")

        print(reputation)

edited Nov 26 at 1:38

Pang

6,8601563101

asked Nov 20 at 20:08

enjal

3341819

2

Why are you doing this with a web crawler instead of using the Stack Exchange Data Explorer?
– Barmar
Nov 20 at 20:16

Duplicate of stackoverflow.com/questions/22786068/…
– Keith John Hutchison
Nov 20 at 20:16

@Barmar with that I won't get this error?
– enjal
Nov 20 at 20:40

You won't be accessing the webserver at all.
– Barmar
Nov 20 at 20:41

Can you please add the imports to source in the question.
– Keith John Hutchison
Nov 20 at 20:46

|
show 2 more comments

I have a list of user ids and I'm interested in crawling their reputation.

My question is, how do I crawl the reputation without getting too many request error?

My code is given below:

for id in df['target']:

    url='https://stackoverflow.com/users/'+str(id)

    print(url)

    response=get(url)

    html_soup=BeautifulSoup(response.text, 'html.parser') 

    site_title = html_soup.find("title").contents[0]

    if "Page Not Found - Stack Overflow" in site_title:

        reputation="NA"

    else:    

        reputation=(html_soup.find(class_='grid--cell fs-title fc-dark')).contents[0].replace(',', "")

        print(reputation)

edited Nov 26 at 1:38

Pang

6,8601563101

asked Nov 20 at 20:08

enjal

3341819

I have a list of user ids and I'm interested in crawling their reputation.

My question is, how do I crawl the reputation without getting too many request error?

My code is given below:

for id in df['target']:

    url='https://stackoverflow.com/users/'+str(id)

    print(url)

    response=get(url)

    html_soup=BeautifulSoup(response.text, 'html.parser') 

    site_title = html_soup.find("title").contents[0]

    if "Page Not Found - Stack Overflow" in site_title:

        reputation="NA"

    else:    

        reputation=(html_soup.find(class_='grid--cell fs-title fc-dark')).contents[0].replace(',', "")

        print(reputation)

python beautifulsoup web-crawler

edited Nov 26 at 1:38

Pang

6,8601563101

asked Nov 20 at 20:08

enjal

3341819

edited Nov 26 at 1:38

Pang

6,8601563101

asked Nov 20 at 20:08

enjal

3341819

edited Nov 26 at 1:38

Pang

6,8601563101

edited Nov 26 at 1:38

Pang

6,8601563101

edited Nov 26 at 1:38

Pang

6,8601563101

asked Nov 20 at 20:08

enjal

3341819

asked Nov 20 at 20:08

enjal

3341819

asked Nov 20 at 20:08

enjal

3341819

2

Why are you doing this with a web crawler instead of using the Stack Exchange Data Explorer?
– Barmar
Nov 20 at 20:16

Duplicate of stackoverflow.com/questions/22786068/…
– Keith John Hutchison
Nov 20 at 20:16

@Barmar with that I won't get this error?
– enjal
Nov 20 at 20:40

You won't be accessing the webserver at all.
– Barmar
Nov 20 at 20:41

Can you please add the imports to source in the question.
– Keith John Hutchison
Nov 20 at 20:46

|
show 2 more comments

2

Why are you doing this with a web crawler instead of using the Stack Exchange Data Explorer?
– Barmar
Nov 20 at 20:16

Duplicate of stackoverflow.com/questions/22786068/…
– Keith John Hutchison
Nov 20 at 20:16

@Barmar with that I won't get this error?
– enjal
Nov 20 at 20:40

You won't be accessing the webserver at all.
– Barmar
Nov 20 at 20:41

Can you please add the imports to source in the question.
– Keith John Hutchison
Nov 20 at 20:46

Why are you doing this with a web crawler instead of using the Stack Exchange Data Explorer?
– Barmar
Nov 20 at 20:16

Duplicate of stackoverflow.com/questions/22786068/…
– Keith John Hutchison
Nov 20 at 20:16

@Barmar with that I won't get this error?
– enjal
Nov 20 at 20:40

You won't be accessing the webserver at all.
– Barmar
Nov 20 at 20:41

Can you please add the imports to source in the question.
– Keith John Hutchison
Nov 20 at 20:46

|
show 2 more comments

2 Answers
2

active

oldest

votes

You can check if response.status_code == 429 and see if there is a value in the response telling you how long to wait for, then wait for the number of seconds you've been asked to.

I duplicated the issue here.
I couldn't find any information on how long to wait in the content or the headers.

I suggest putting in some throttles and adjust until you're happy with the results.

See https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site for an example for getting user reputations from the Stack Exchange Data Explorer.

Example follows.

#!/usr/bin/env python



import time

import requests

from bs4 import BeautifulSoup



df={}

df['target']=[ ... ] # see https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site



throttle = 2

whoa = 450



with open('results.txt','w') as file_handler:

    file_handler.write('urltreputationn')

    for id in df['target']:

        time.sleep(throttle)

        url='https://stackoverflow.com/users/'+str(id)

        print(url)

        response=requests.get(url)

        while response.status_code == 429:

            print(response.content)

            print(response.headers)

            time.sleep(whoa)

            response=requests.get(url)

        html_soup=BeautifulSoup(response.text, 'html.parser')

        site_title = html_soup.find("title").contents[0]

        if "Page Not Found - Stack Overflow" in site_title:

            reputation="NA"

        else:    

            reputation=(html_soup.find(class_='grid--cell fs-title fc-dark')).contents[0].replace(',', "")

        print('reputation: %s' % reputation)

        file_handler.write('%st%sn' % (url,reputation))

Example error content.

<!DOCTYPE html>

<html>

<head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

    <title>Too Many Requests - Stack Exchange</title>

    <style type="text/css">

        body

        {

            color: #333;

            font-family: 'Helvetica Neue', Arial, sans-serif;

            font-size: 14px;

            background: #fff url('img/bg-noise.png') repeat left top;

            line-height: 1.4;

        }

        h1

        {

            font-size: 170%;

            line-height: 34px;

            font-weight: normal;

        }

        a { color: #366fb3; }

        a:visited { color: #12457c; }

        .wrapper {

            width:960px;

            margin: 100px auto;

            text-align:left;

        }

        .msg {

            float: left;

            width: 700px;

            padding-top: 18px;

            margin-left: 18px;

        }

    </style>

</head>

<body>

    <div class="wrapper">

        <div style="float: left;">

            <img src="https://cdn.sstatic.net/stackexchange/img/apple-touch-icon.png" alt="Stack Exchange" />

        </div>

        <div class="msg">

            <h1>Too many requests</h1>

                        <p>This IP address (nnn.nnn.nnn.nnn) has performed an unusual high number of requests and has been temporarily rate limited. If you believe this to be in error, please contact us at <a href="mailto:team@stackexchange.com?Subject=Rate%20limiting%20of%20nnn.nnn.nnn.nnn%20(Request%20ID%3A%202158483152-SYD)">team@stackexchange.com</a>.</p>

                        <p>When contacting us, please include the following information in the email:</p>

                        <p>Method: rate limit</p>

                        <p>XID: 2158483152-SYD</p>

                        <p>IP: nnn.nnn.nnn.nnn</p>

                        <p>X-Forwarded-For: nnn.nnn.nnn.nnn</p>

                        <p>User-Agent: python-requests/2.20.1</p>

                        <p>Reason: Request rate.</p>

                        <p>Time: Tue, 20 Nov 2018 21:10:55 GMT</p>

                        <p>URL: stackoverflow.com/users/nnnnnnn</p>

                        <p>Browser Location: <span id="jslocation">(not loaded)</span></p>

        </div>

    </div>

    <script>document.getElementById('jslocation').innerHTML = window.location.href;</script>

</body>

</html>

Example error headers.

{
"Content-Length": "2054",
"Via": "1.1 varnish",
"X-Cache": "MISS",
"X-DNS-Prefetch-Control": "off",
"Accept-Ranges": "bytes",
"X-Timer": "S1542748255.394076,VS0,VE0",
"Server": "Varnish",
"Retry-After": "0",
"Connection": "close",
"X-Served-By": "cache-syd18924-SYD",
"X-Cache-Hits": "0",
"Date": "Tue, 20 Nov 2018 21:10:55 GMT",
"Content-Type": "text/html"
}

edited Nov 20 at 22:49

answered Nov 20 at 20:20

Keith John Hutchison

2,54042531

the error headers don't tell how long to wait for which is sad. Thanks for the help but a wait time for 15 minutes is a lot. I was just wondering - if it just keeps on checking the response until response is not 429 and then breaks and does the normal operation. Is wait time still necessary then?
– enjal
Nov 20 at 21:31

You can adjust the throttle amounts as you wish. If the throttle is set correctly you'll never hit 'whoa'. I'm not sure how long the rate limit is. Stack overflow should send back information saying wait x number of seconds.
– Keith John Hutchison
Nov 20 at 21:35

I'm doing a run with throttle = 2, whoa = 450.
– Keith John Hutchison
Nov 20 at 21:46

Which processed 500 urls with no issues.
– Keith John Hutchison
Nov 20 at 22:13

1

It's still running. But I think this will get the job done. Thanks.
– enjal
Nov 24 at 4:35

|
show 5 more comments

I suggest using the Python time module and throwing a time.sleep(5) in your for loop. The error is coming from you making too many requests in too short a time period. You may have to play around with the actual sleep time to get it right, though.

answered Nov 20 at 20:12

ahota

37128

I don't think that putting sleep is a good thing to do.
– enjal
Nov 20 at 20:58

1

@enjal why do you think that?
– ahota
Nov 20 at 21:01

Because it difficult to get the exact sleep time.
– enjal
Nov 20 at 21:17

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53400775%2ftoo-many-requests-error-while-crawling-users-reputation-from-stack-overflow%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

You can check if response.status_code == 429 and see if there is a value in the response telling you how long to wait for, then wait for the number of seconds you've been asked to.

I duplicated the issue here.
I couldn't find any information on how long to wait in the content or the headers.

I suggest putting in some throttles and adjust until you're happy with the results.

See https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site for an example for getting user reputations from the Stack Exchange Data Explorer.

Example follows.

#!/usr/bin/env python



import time

import requests

from bs4 import BeautifulSoup



df={}

df['target']=[ ... ] # see https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site



throttle = 2

whoa = 450



with open('results.txt','w') as file_handler:

    file_handler.write('urltreputationn')

    for id in df['target']:

        time.sleep(throttle)

        url='https://stackoverflow.com/users/'+str(id)

        print(url)

        response=requests.get(url)

        while response.status_code == 429:

            print(response.content)

            print(response.headers)

            time.sleep(whoa)

            response=requests.get(url)

        html_soup=BeautifulSoup(response.text, 'html.parser')

        site_title = html_soup.find("title").contents[0]

        if "Page Not Found - Stack Overflow" in site_title:

            reputation="NA"

        else:    

            reputation=(html_soup.find(class_='grid--cell fs-title fc-dark')).contents[0].replace(',', "")

        print('reputation: %s' % reputation)

        file_handler.write('%st%sn' % (url,reputation))

Example error content.

<!DOCTYPE html>

<html>

<head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

    <title>Too Many Requests - Stack Exchange</title>

    <style type="text/css">

        body

        {

            color: #333;

            font-family: 'Helvetica Neue', Arial, sans-serif;

            font-size: 14px;

            background: #fff url('img/bg-noise.png') repeat left top;

            line-height: 1.4;

        }

        h1

        {

            font-size: 170%;

            line-height: 34px;

            font-weight: normal;

        }

        a { color: #366fb3; }

        a:visited { color: #12457c; }

        .wrapper {

            width:960px;

            margin: 100px auto;

            text-align:left;

        }

        .msg {

            float: left;

            width: 700px;

            padding-top: 18px;

            margin-left: 18px;

        }

    </style>

</head>

<body>

    <div class="wrapper">

        <div style="float: left;">

            <img src="https://cdn.sstatic.net/stackexchange/img/apple-touch-icon.png" alt="Stack Exchange" />

        </div>

        <div class="msg">

            <h1>Too many requests</h1>

                        <p>This IP address (nnn.nnn.nnn.nnn) has performed an unusual high number of requests and has been temporarily rate limited. If you believe this to be in error, please contact us at <a href="mailto:team@stackexchange.com?Subject=Rate%20limiting%20of%20nnn.nnn.nnn.nnn%20(Request%20ID%3A%202158483152-SYD)">team@stackexchange.com</a>.</p>

                        <p>When contacting us, please include the following information in the email:</p>

                        <p>Method: rate limit</p>

                        <p>XID: 2158483152-SYD</p>

                        <p>IP: nnn.nnn.nnn.nnn</p>

                        <p>X-Forwarded-For: nnn.nnn.nnn.nnn</p>

                        <p>User-Agent: python-requests/2.20.1</p>

                        <p>Reason: Request rate.</p>

                        <p>Time: Tue, 20 Nov 2018 21:10:55 GMT</p>

                        <p>URL: stackoverflow.com/users/nnnnnnn</p>

                        <p>Browser Location: <span id="jslocation">(not loaded)</span></p>

        </div>

    </div>

    <script>document.getElementById('jslocation').innerHTML = window.location.href;</script>

</body>

</html>

Example error headers.

edited Nov 20 at 22:49

answered Nov 20 at 20:20

Keith John Hutchison

2,54042531

the error headers don't tell how long to wait for which is sad. Thanks for the help but a wait time for 15 minutes is a lot. I was just wondering - if it just keeps on checking the response until response is not 429 and then breaks and does the normal operation. Is wait time still necessary then?
– enjal
Nov 20 at 21:31

You can adjust the throttle amounts as you wish. If the throttle is set correctly you'll never hit 'whoa'. I'm not sure how long the rate limit is. Stack overflow should send back information saying wait x number of seconds.
– Keith John Hutchison
Nov 20 at 21:35

I'm doing a run with throttle = 2, whoa = 450.
– Keith John Hutchison
Nov 20 at 21:46

Which processed 500 urls with no issues.
– Keith John Hutchison
Nov 20 at 22:13

1

It's still running. But I think this will get the job done. Thanks.
– enjal
Nov 24 at 4:35

|
show 5 more comments

You can check if response.status_code == 429 and see if there is a value in the response telling you how long to wait for, then wait for the number of seconds you've been asked to.

I duplicated the issue here.
I couldn't find any information on how long to wait in the content or the headers.

I suggest putting in some throttles and adjust until you're happy with the results.

See https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site for an example for getting user reputations from the Stack Exchange Data Explorer.

Example follows.

#!/usr/bin/env python



import time

import requests

from bs4 import BeautifulSoup



df={}

df['target']=[ ... ] # see https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site



throttle = 2

whoa = 450



with open('results.txt','w') as file_handler:

    file_handler.write('urltreputationn')

    for id in df['target']:

        time.sleep(throttle)

        url='https://stackoverflow.com/users/'+str(id)

        print(url)

        response=requests.get(url)

        while response.status_code == 429:

            print(response.content)

            print(response.headers)

            time.sleep(whoa)

            response=requests.get(url)

        html_soup=BeautifulSoup(response.text, 'html.parser')

        site_title = html_soup.find("title").contents[0]

        if "Page Not Found - Stack Overflow" in site_title:

            reputation="NA"

        else:    

            reputation=(html_soup.find(class_='grid--cell fs-title fc-dark')).contents[0].replace(',', "")

        print('reputation: %s' % reputation)

        file_handler.write('%st%sn' % (url,reputation))

Example error content.

<!DOCTYPE html>

<html>

<head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

    <title>Too Many Requests - Stack Exchange</title>

    <style type="text/css">

        body

        {

            color: #333;

            font-family: 'Helvetica Neue', Arial, sans-serif;

            font-size: 14px;

            background: #fff url('img/bg-noise.png') repeat left top;

            line-height: 1.4;

        }

        h1

        {

            font-size: 170%;

            line-height: 34px;

            font-weight: normal;

        }

        a { color: #366fb3; }

        a:visited { color: #12457c; }

        .wrapper {

            width:960px;

            margin: 100px auto;

            text-align:left;

        }

        .msg {

            float: left;

            width: 700px;

            padding-top: 18px;

            margin-left: 18px;

        }

    </style>

</head>

<body>

    <div class="wrapper">

        <div style="float: left;">

            <img src="https://cdn.sstatic.net/stackexchange/img/apple-touch-icon.png" alt="Stack Exchange" />

        </div>

        <div class="msg">

            <h1>Too many requests</h1>

                        <p>This IP address (nnn.nnn.nnn.nnn) has performed an unusual high number of requests and has been temporarily rate limited. If you believe this to be in error, please contact us at <a href="mailto:team@stackexchange.com?Subject=Rate%20limiting%20of%20nnn.nnn.nnn.nnn%20(Request%20ID%3A%202158483152-SYD)">team@stackexchange.com</a>.</p>

                        <p>When contacting us, please include the following information in the email:</p>

                        <p>Method: rate limit</p>

                        <p>XID: 2158483152-SYD</p>

                        <p>IP: nnn.nnn.nnn.nnn</p>

                        <p>X-Forwarded-For: nnn.nnn.nnn.nnn</p>

                        <p>User-Agent: python-requests/2.20.1</p>

                        <p>Reason: Request rate.</p>

                        <p>Time: Tue, 20 Nov 2018 21:10:55 GMT</p>

                        <p>URL: stackoverflow.com/users/nnnnnnn</p>

                        <p>Browser Location: <span id="jslocation">(not loaded)</span></p>

        </div>

    </div>

    <script>document.getElementById('jslocation').innerHTML = window.location.href;</script>

</body>

</html>

Example error headers.

edited Nov 20 at 22:49

answered Nov 20 at 20:20

Keith John Hutchison

2,54042531

the error headers don't tell how long to wait for which is sad. Thanks for the help but a wait time for 15 minutes is a lot. I was just wondering - if it just keeps on checking the response until response is not 429 and then breaks and does the normal operation. Is wait time still necessary then?
– enjal
Nov 20 at 21:31

You can adjust the throttle amounts as you wish. If the throttle is set correctly you'll never hit 'whoa'. I'm not sure how long the rate limit is. Stack overflow should send back information saying wait x number of seconds.
– Keith John Hutchison
Nov 20 at 21:35

I'm doing a run with throttle = 2, whoa = 450.
– Keith John Hutchison
Nov 20 at 21:46

Which processed 500 urls with no issues.
– Keith John Hutchison
Nov 20 at 22:13

1

It's still running. But I think this will get the job done. Thanks.
– enjal
Nov 24 at 4:35

|
show 5 more comments

You can check if response.status_code == 429 and see if there is a value in the response telling you how long to wait for, then wait for the number of seconds you've been asked to.

I duplicated the issue here.
I couldn't find any information on how long to wait in the content or the headers.

I suggest putting in some throttles and adjust until you're happy with the results.

See https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site for an example for getting user reputations from the Stack Exchange Data Explorer.

Example follows.

#!/usr/bin/env python



import time

import requests

from bs4 import BeautifulSoup



df={}

df['target']=[ ... ] # see https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site



throttle = 2

whoa = 450



with open('results.txt','w') as file_handler:

    file_handler.write('urltreputationn')

    for id in df['target']:

        time.sleep(throttle)

        url='https://stackoverflow.com/users/'+str(id)

        print(url)

        response=requests.get(url)

        while response.status_code == 429:

            print(response.content)

            print(response.headers)

            time.sleep(whoa)

            response=requests.get(url)

        html_soup=BeautifulSoup(response.text, 'html.parser')

        site_title = html_soup.find("title").contents[0]

        if "Page Not Found - Stack Overflow" in site_title:

            reputation="NA"

        else:    

            reputation=(html_soup.find(class_='grid--cell fs-title fc-dark')).contents[0].replace(',', "")

        print('reputation: %s' % reputation)

        file_handler.write('%st%sn' % (url,reputation))

Example error content.

<!DOCTYPE html>

<html>

<head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

    <title>Too Many Requests - Stack Exchange</title>

    <style type="text/css">

        body

        {

            color: #333;

            font-family: 'Helvetica Neue', Arial, sans-serif;

            font-size: 14px;

            background: #fff url('img/bg-noise.png') repeat left top;

            line-height: 1.4;

        }

        h1

        {

            font-size: 170%;

            line-height: 34px;

            font-weight: normal;

        }

        a { color: #366fb3; }

        a:visited { color: #12457c; }

        .wrapper {

            width:960px;

            margin: 100px auto;

            text-align:left;

        }

        .msg {

            float: left;

            width: 700px;

            padding-top: 18px;

            margin-left: 18px;

        }

    </style>

</head>

<body>

    <div class="wrapper">

        <div style="float: left;">

            <img src="https://cdn.sstatic.net/stackexchange/img/apple-touch-icon.png" alt="Stack Exchange" />

        </div>

        <div class="msg">

            <h1>Too many requests</h1>

                        <p>This IP address (nnn.nnn.nnn.nnn) has performed an unusual high number of requests and has been temporarily rate limited. If you believe this to be in error, please contact us at <a href="mailto:team@stackexchange.com?Subject=Rate%20limiting%20of%20nnn.nnn.nnn.nnn%20(Request%20ID%3A%202158483152-SYD)">team@stackexchange.com</a>.</p>

                        <p>When contacting us, please include the following information in the email:</p>

                        <p>Method: rate limit</p>

                        <p>XID: 2158483152-SYD</p>

                        <p>IP: nnn.nnn.nnn.nnn</p>

                        <p>X-Forwarded-For: nnn.nnn.nnn.nnn</p>

                        <p>User-Agent: python-requests/2.20.1</p>

                        <p>Reason: Request rate.</p>

                        <p>Time: Tue, 20 Nov 2018 21:10:55 GMT</p>

                        <p>URL: stackoverflow.com/users/nnnnnnn</p>

                        <p>Browser Location: <span id="jslocation">(not loaded)</span></p>

        </div>

    </div>

    <script>document.getElementById('jslocation').innerHTML = window.location.href;</script>

</body>

</html>

Example error headers.

edited Nov 20 at 22:49

answered Nov 20 at 20:20

Keith John Hutchison

2,54042531

You can check if response.status_code == 429 and see if there is a value in the response telling you how long to wait for, then wait for the number of seconds you've been asked to.

I duplicated the issue here.
I couldn't find any information on how long to wait in the content or the headers.

I suggest putting in some throttles and adjust until you're happy with the results.

See https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site for an example for getting user reputations from the Stack Exchange Data Explorer.

Example follows.

#!/usr/bin/env python



import time

import requests

from bs4 import BeautifulSoup



df={}

df['target']=[ ... ] # see https://data.stackexchange.com/stackoverflow/query/952/top-500-answerers-on-the-site



throttle = 2

whoa = 450



with open('results.txt','w') as file_handler:

    file_handler.write('urltreputationn')

    for id in df['target']:

        time.sleep(throttle)

        url='https://stackoverflow.com/users/'+str(id)

        print(url)

        response=requests.get(url)

        while response.status_code == 429:

            print(response.content)

            print(response.headers)

            time.sleep(whoa)

            response=requests.get(url)

        html_soup=BeautifulSoup(response.text, 'html.parser')

        site_title = html_soup.find("title").contents[0]

        if "Page Not Found - Stack Overflow" in site_title:

            reputation="NA"

        else:    

            reputation=(html_soup.find(class_='grid--cell fs-title fc-dark')).contents[0].replace(',', "")

        print('reputation: %s' % reputation)

        file_handler.write('%st%sn' % (url,reputation))

Example error content.

<!DOCTYPE html>

<html>

<head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

    <title>Too Many Requests - Stack Exchange</title>

    <style type="text/css">

        body

        {

            color: #333;

            font-family: 'Helvetica Neue', Arial, sans-serif;

            font-size: 14px;

            background: #fff url('img/bg-noise.png') repeat left top;

            line-height: 1.4;

        }

        h1

        {

            font-size: 170%;

            line-height: 34px;

            font-weight: normal;

        }

        a { color: #366fb3; }

        a:visited { color: #12457c; }

        .wrapper {

            width:960px;

            margin: 100px auto;

            text-align:left;

        }

        .msg {

            float: left;

            width: 700px;

            padding-top: 18px;

            margin-left: 18px;

        }

    </style>

</head>

<body>

    <div class="wrapper">

        <div style="float: left;">

            <img src="https://cdn.sstatic.net/stackexchange/img/apple-touch-icon.png" alt="Stack Exchange" />

        </div>

        <div class="msg">

            <h1>Too many requests</h1>

                        <p>This IP address (nnn.nnn.nnn.nnn) has performed an unusual high number of requests and has been temporarily rate limited. If you believe this to be in error, please contact us at <a href="mailto:team@stackexchange.com?Subject=Rate%20limiting%20of%20nnn.nnn.nnn.nnn%20(Request%20ID%3A%202158483152-SYD)">team@stackexchange.com</a>.</p>

                        <p>When contacting us, please include the following information in the email:</p>

                        <p>Method: rate limit</p>

                        <p>XID: 2158483152-SYD</p>

                        <p>IP: nnn.nnn.nnn.nnn</p>

                        <p>X-Forwarded-For: nnn.nnn.nnn.nnn</p>

                        <p>User-Agent: python-requests/2.20.1</p>

                        <p>Reason: Request rate.</p>

                        <p>Time: Tue, 20 Nov 2018 21:10:55 GMT</p>

                        <p>URL: stackoverflow.com/users/nnnnnnn</p>

                        <p>Browser Location: <span id="jslocation">(not loaded)</span></p>

        </div>

    </div>

    <script>document.getElementById('jslocation').innerHTML = window.location.href;</script>

</body>

</html>

Example error headers.

edited Nov 20 at 22:49

answered Nov 20 at 20:20

Keith John Hutchison

2,54042531

edited Nov 20 at 22:49

answered Nov 20 at 20:20

Keith John Hutchison

2,54042531

answered Nov 20 at 20:20

Keith John Hutchison

2,54042531

answered Nov 20 at 20:20

Keith John Hutchison

2,54042531

the error headers don't tell how long to wait for which is sad. Thanks for the help but a wait time for 15 minutes is a lot. I was just wondering - if it just keeps on checking the response until response is not 429 and then breaks and does the normal operation. Is wait time still necessary then?
– enjal
Nov 20 at 21:31

You can adjust the throttle amounts as you wish. If the throttle is set correctly you'll never hit 'whoa'. I'm not sure how long the rate limit is. Stack overflow should send back information saying wait x number of seconds.
– Keith John Hutchison
Nov 20 at 21:35

I'm doing a run with throttle = 2, whoa = 450.
– Keith John Hutchison
Nov 20 at 21:46

Which processed 500 urls with no issues.
– Keith John Hutchison
Nov 20 at 22:13

1

It's still running. But I think this will get the job done. Thanks.
– enjal
Nov 24 at 4:35

|
show 5 more comments

the error headers don't tell how long to wait for which is sad. Thanks for the help but a wait time for 15 minutes is a lot. I was just wondering - if it just keeps on checking the response until response is not 429 and then breaks and does the normal operation. Is wait time still necessary then?
– enjal
Nov 20 at 21:31

You can adjust the throttle amounts as you wish. If the throttle is set correctly you'll never hit 'whoa'. I'm not sure how long the rate limit is. Stack overflow should send back information saying wait x number of seconds.
– Keith John Hutchison
Nov 20 at 21:35

I'm doing a run with throttle = 2, whoa = 450.
– Keith John Hutchison
Nov 20 at 21:46

Which processed 500 urls with no issues.
– Keith John Hutchison
Nov 20 at 22:13

1

It's still running. But I think this will get the job done. Thanks.
– enjal
Nov 24 at 4:35

the error headers don't tell how long to wait for which is sad. Thanks for the help but a wait time for 15 minutes is a lot. I was just wondering - if it just keeps on checking the response until response is not 429 and then breaks and does the normal operation. Is wait time still necessary then?
– enjal
Nov 20 at 21:31

You can adjust the throttle amounts as you wish. If the throttle is set correctly you'll never hit 'whoa'. I'm not sure how long the rate limit is. Stack overflow should send back information saying wait x number of seconds.
– Keith John Hutchison
Nov 20 at 21:35

I'm doing a run with throttle = 2, whoa = 450.
– Keith John Hutchison
Nov 20 at 21:46

Which processed 500 urls with no issues.
– Keith John Hutchison
Nov 20 at 22:13

It's still running. But I think this will get the job done. Thanks.
– enjal
Nov 24 at 4:35

|
show 5 more comments

answered Nov 20 at 20:12

ahota

37128

I don't think that putting sleep is a good thing to do.
– enjal
Nov 20 at 20:58

1

@enjal why do you think that?
– ahota
Nov 20 at 21:01

Because it difficult to get the exact sleep time.
– enjal
Nov 20 at 21:17

add a comment |

answered Nov 20 at 20:12

ahota

37128

I don't think that putting sleep is a good thing to do.
– enjal
Nov 20 at 20:58

1

@enjal why do you think that?
– ahota
Nov 20 at 21:01

Because it difficult to get the exact sleep time.
– enjal
Nov 20 at 21:17

add a comment |

answered Nov 20 at 20:12

ahota

37128

answered Nov 20 at 20:12

ahota

37128

answered Nov 20 at 20:12

ahota

37128

answered Nov 20 at 20:12

ahota

37128

answered Nov 20 at 20:12

ahota

37128

I don't think that putting sleep is a good thing to do.
– enjal
Nov 20 at 20:58

1

@enjal why do you think that?
– ahota
Nov 20 at 21:01

Because it difficult to get the exact sleep time.
– enjal
Nov 20 at 21:17

add a comment |

I don't think that putting sleep is a good thing to do.
– enjal
Nov 20 at 20:58

1

@enjal why do you think that?
– ahota
Nov 20 at 21:01

Because it difficult to get the exact sleep time.
– enjal
Nov 20 at 21:17

I don't think that putting sleep is a good thing to do.
– enjal
Nov 20 at 20:58

@enjal why do you think that?
– ahota
Nov 20 at 21:01

Because it difficult to get the exact sleep time.
– enjal
Nov 20 at 21:17

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Tukukkk