Index out of range when sending requests in a loop
I encounter an index out of range error when I try to get the number of contributors of a GitHub project in a loop. After some iterations (which are working perfectly) it just throws that exception. I have no clue why ...
for x in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number) # prints the correct number until the exception
Here's the exception.
----> 4 contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
IndexError: list index out of range
python indexoutofrangeexception
add a comment |
I encounter an index out of range error when I try to get the number of contributors of a GitHub project in a loop. After some iterations (which are working perfectly) it just throws that exception. I have no clue why ...
for x in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number) # prints the correct number until the exception
Here's the exception.
----> 4 contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
IndexError: list index out of range
python indexoutofrangeexception
3
You are almost certainly being blocked by GitHub. Use their API, don't scrape the site.
– Martijn Pieters♦
Nov 23 '18 at 12:55
add a comment |
I encounter an index out of range error when I try to get the number of contributors of a GitHub project in a loop. After some iterations (which are working perfectly) it just throws that exception. I have no clue why ...
for x in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number) # prints the correct number until the exception
Here's the exception.
----> 4 contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
IndexError: list index out of range
python indexoutofrangeexception
I encounter an index out of range error when I try to get the number of contributors of a GitHub project in a loop. After some iterations (which are working perfectly) it just throws that exception. I have no clue why ...
for x in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number) # prints the correct number until the exception
Here's the exception.
----> 4 contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
IndexError: list index out of range
python indexoutofrangeexception
python indexoutofrangeexception
asked Nov 23 '18 at 12:48
MaxMax
467
467
3
You are almost certainly being blocked by GitHub. Use their API, don't scrape the site.
– Martijn Pieters♦
Nov 23 '18 at 12:55
add a comment |
3
You are almost certainly being blocked by GitHub. Use their API, don't scrape the site.
– Martijn Pieters♦
Nov 23 '18 at 12:55
3
3
You are almost certainly being blocked by GitHub. Use their API, don't scrape the site.
– Martijn Pieters♦
Nov 23 '18 at 12:55
You are almost certainly being blocked by GitHub. Use their API, don't scrape the site.
– Martijn Pieters♦
Nov 23 '18 at 12:55
add a comment |
3 Answers
3
active
oldest
votes
It seems likely that you're getting a 429 - Too many requests since you're firing requests of one after the other.
You might want to modify your code as such:
import time
for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
time.sleep(3) # Wait a bit before firing of another request
Better yet would be:
import time
for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
if r.status_code in [200]: # Check if the request was successful
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
else:
print("Failed fetching page, status code: " + str(r.status_code))
time.sleep(3) # Wait a bit before firing of another request
They should be using the GitHub API rather than scrape the human-readable side.
– Martijn Pieters♦
Nov 23 '18 at 12:55
Agreed, but if they need help with understanding why this simple script fails, I doubt they're up to date on using APIs.
– alexisdevarennes
Nov 23 '18 at 12:56
This is the case indeed. I was able to reproduce the issue and got the following response back after a couple of requests: jsfiddle.net/48v3rt6o (You have triggered an abuse detection mechanism. Please wait a few minutes before you try again.
)
– Matias Cicero
Nov 23 '18 at 12:57
@Max: contributor information is available just fine: developer.github.com/v3/repos/collaborators
– Martijn Pieters♦
Nov 23 '18 at 13:01
1
@Max: those limitations are there for a reason. Trying to bypass limitations set by GitHub just means they'll limit you from accessing altogether. This is not a game you can win.
– Martijn Pieters♦
Nov 23 '18 at 13:05
|
show 7 more comments
Now this works perfectly for me while using the API. Probably the cleanest way of doing it.
import requests
import json
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'
response = requests.get(url)
commits = json.loads(response.text)
commits_total = len(commits)
page_number = 1
while(len(commits) == 100):
page_number += 1
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'+'&page='+str(page_number)
response = requests.get(url)
commits = json.loads(response.text)
commits_total += len(commits)
Don't guess at the page numbers. Instead, useresponse.links['next']['url']
(if set) or go straight toresponse.links['last']['url']
.
– Martijn Pieters♦
Nov 23 '18 at 16:02
Yes, this code is indeed not perfect, but I won't work at it any further.
– Max
Nov 23 '18 at 16:19
add a comment |
GitHub is blocking your repeated requests. Do not scrape sites in quick succession, many website operators actively block too many requests. As a result, the content that is returned no longer matches your XPath query.
You should be using the REST API that GitHub provides to retrieve project stats like the number of contributors, and you should implement some kind of rate limiting. There is no need to retrieve the same number 100 times, contributor counts do not change that rapidly.
API responses include information on how many requests you can make in a time window, and you can use conditional requests to only incur rate limit costs when the data actually has changed:
import requests
import time
from urllib.parse import parse_qsl, urlparse
owner, repo = 'tipsy', 'profile-summary-for-github'
github_username = '....'
# token = '....' # optional Github basic auth token
stats = 'https://api.github.com/repos/{}/{}/contributors'
with requests.session() as sess:
# GitHub requests you use your username or appname in the header
sess.headers['User-Agent'] += ' - {}'.format(github_username)
# Consider logging in! You'll get more quota
# sess.auth = (github_username, token)
# start with the first, move to the last when available, include anonymous
last_page = stats.format(owner, repo) + '?per_page=100&page=1&anon=true'
while True:
r = sess.get(last_page)
if r.status_code == requests.codes.not_found:
print("No such repo")
break
if r.status_code == requests.codes.no_content:
print("No contributors, repository is empty")
break
if r.status_code == requests.codes.accepted:
print("Stats not yet ready, retrying")
elif r.status_code == requests.codes.not_modified:
print("Stats not changed")
elif r.ok:
# success! Check for a last page, get that instead of current
# to get accurate count
link_last = r.links.get('last', {}).get('url')
if link_last and r.url != link_last:
last_page = link_last
else:
# this is the last page, report on count
params = dict(parse_qsl(urlparse(r.url).query))
page_num = int(params.get('page', '1'))
per_page = int(params.get('per_page', '100'))
contributor_count = len(r.json()) + (per_page * (page_num - 1))
print("Contributor count:", contributor_count)
# only get us a fresh response next time
sess.headers['If-None-Match'] = r.headers['ETag']
# pace ourselves following the rate limit
window_remaining = int(r.headers['X-RateLimit-Reset']) - time.time()
rate_remaining = int(r.headers['X-RateLimit-Remaining'])
# sleep long enough to honour the rate limit or at least 100 milliseconds
time.sleep(max(window_remaining / rate_remaining, 0.1))
The above uses a requests session object to handle repeated headers and ensure that you get to reuse connections where possible.
A good library such as github3.py
(incidentally written by a requests
core contributor) will take care of most of those details for you.
If you do want to persist on scraping the site directly, you do take a risk that the site operators block you altogether. Try to take some responsibility by not hammering the site continually.
That means that at the very least, you should honour the Retry-After
header that GitHub gives you on 429:
if not r.ok:
print("Received a response other that 200 OK:", r.status_code, r.reason)
retry_after = r.headers.get('Retry-After')
if retry_after is not None:
print("Response included a Retry-After:", retry_after)
time.sleep(int(retry_after))
else:
# parse OK response
This post is from '13, I'm able to get more than 100 by using the loop in my answer. Nevertheless I ran into the problem that I sent too many requests when I was trying to fetch the total amount of commits of the Linux kernel which has ~800.000 commits. I got blocked after 70 iterations in my while loop. The error message was: "API rate limit exceeded for [IP] (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"
– Max
Nov 23 '18 at 14:34
@Max: The code in my answer works for repos I tested. What's the Linux Kernel repository URL?
– Martijn Pieters♦
Nov 23 '18 at 14:53
github.com/torvalds/linux
– Max
Nov 23 '18 at 14:53
@Max: The history or contributor list is too large to list contributors for this repository via the API." and a 403 forbidden. Clone that repository and get the count from git directly.
– Martijn Pieters♦
Nov 23 '18 at 15:09
Alright, in my case it's easier to use the following git command since I download the repositories anyway. But in the near future, the easiest way is probably to use the new GitHub API V4 (developer.github.com/v4).git rev-list --all --count
– Max
Nov 23 '18 at 15:21
|
show 3 more comments
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53447036%2findex-out-of-range-when-sending-requests-in-a-loop%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
It seems likely that you're getting a 429 - Too many requests since you're firing requests of one after the other.
You might want to modify your code as such:
import time
for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
time.sleep(3) # Wait a bit before firing of another request
Better yet would be:
import time
for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
if r.status_code in [200]: # Check if the request was successful
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
else:
print("Failed fetching page, status code: " + str(r.status_code))
time.sleep(3) # Wait a bit before firing of another request
They should be using the GitHub API rather than scrape the human-readable side.
– Martijn Pieters♦
Nov 23 '18 at 12:55
Agreed, but if they need help with understanding why this simple script fails, I doubt they're up to date on using APIs.
– alexisdevarennes
Nov 23 '18 at 12:56
This is the case indeed. I was able to reproduce the issue and got the following response back after a couple of requests: jsfiddle.net/48v3rt6o (You have triggered an abuse detection mechanism. Please wait a few minutes before you try again.
)
– Matias Cicero
Nov 23 '18 at 12:57
@Max: contributor information is available just fine: developer.github.com/v3/repos/collaborators
– Martijn Pieters♦
Nov 23 '18 at 13:01
1
@Max: those limitations are there for a reason. Trying to bypass limitations set by GitHub just means they'll limit you from accessing altogether. This is not a game you can win.
– Martijn Pieters♦
Nov 23 '18 at 13:05
|
show 7 more comments
It seems likely that you're getting a 429 - Too many requests since you're firing requests of one after the other.
You might want to modify your code as such:
import time
for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
time.sleep(3) # Wait a bit before firing of another request
Better yet would be:
import time
for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
if r.status_code in [200]: # Check if the request was successful
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
else:
print("Failed fetching page, status code: " + str(r.status_code))
time.sleep(3) # Wait a bit before firing of another request
They should be using the GitHub API rather than scrape the human-readable side.
– Martijn Pieters♦
Nov 23 '18 at 12:55
Agreed, but if they need help with understanding why this simple script fails, I doubt they're up to date on using APIs.
– alexisdevarennes
Nov 23 '18 at 12:56
This is the case indeed. I was able to reproduce the issue and got the following response back after a couple of requests: jsfiddle.net/48v3rt6o (You have triggered an abuse detection mechanism. Please wait a few minutes before you try again.
)
– Matias Cicero
Nov 23 '18 at 12:57
@Max: contributor information is available just fine: developer.github.com/v3/repos/collaborators
– Martijn Pieters♦
Nov 23 '18 at 13:01
1
@Max: those limitations are there for a reason. Trying to bypass limitations set by GitHub just means they'll limit you from accessing altogether. This is not a game you can win.
– Martijn Pieters♦
Nov 23 '18 at 13:05
|
show 7 more comments
It seems likely that you're getting a 429 - Too many requests since you're firing requests of one after the other.
You might want to modify your code as such:
import time
for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
time.sleep(3) # Wait a bit before firing of another request
Better yet would be:
import time
for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
if r.status_code in [200]: # Check if the request was successful
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
else:
print("Failed fetching page, status code: " + str(r.status_code))
time.sleep(3) # Wait a bit before firing of another request
It seems likely that you're getting a 429 - Too many requests since you're firing requests of one after the other.
You might want to modify your code as such:
import time
for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
time.sleep(3) # Wait a bit before firing of another request
Better yet would be:
import time
for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
if r.status_code in [200]: # Check if the request was successful
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
else:
print("Failed fetching page, status code: " + str(r.status_code))
time.sleep(3) # Wait a bit before firing of another request
edited Nov 23 '18 at 12:58
answered Nov 23 '18 at 12:54
alexisdevarennesalexisdevarennes
1,96721229
1,96721229
They should be using the GitHub API rather than scrape the human-readable side.
– Martijn Pieters♦
Nov 23 '18 at 12:55
Agreed, but if they need help with understanding why this simple script fails, I doubt they're up to date on using APIs.
– alexisdevarennes
Nov 23 '18 at 12:56
This is the case indeed. I was able to reproduce the issue and got the following response back after a couple of requests: jsfiddle.net/48v3rt6o (You have triggered an abuse detection mechanism. Please wait a few minutes before you try again.
)
– Matias Cicero
Nov 23 '18 at 12:57
@Max: contributor information is available just fine: developer.github.com/v3/repos/collaborators
– Martijn Pieters♦
Nov 23 '18 at 13:01
1
@Max: those limitations are there for a reason. Trying to bypass limitations set by GitHub just means they'll limit you from accessing altogether. This is not a game you can win.
– Martijn Pieters♦
Nov 23 '18 at 13:05
|
show 7 more comments
They should be using the GitHub API rather than scrape the human-readable side.
– Martijn Pieters♦
Nov 23 '18 at 12:55
Agreed, but if they need help with understanding why this simple script fails, I doubt they're up to date on using APIs.
– alexisdevarennes
Nov 23 '18 at 12:56
This is the case indeed. I was able to reproduce the issue and got the following response back after a couple of requests: jsfiddle.net/48v3rt6o (You have triggered an abuse detection mechanism. Please wait a few minutes before you try again.
)
– Matias Cicero
Nov 23 '18 at 12:57
@Max: contributor information is available just fine: developer.github.com/v3/repos/collaborators
– Martijn Pieters♦
Nov 23 '18 at 13:01
1
@Max: those limitations are there for a reason. Trying to bypass limitations set by GitHub just means they'll limit you from accessing altogether. This is not a game you can win.
– Martijn Pieters♦
Nov 23 '18 at 13:05
They should be using the GitHub API rather than scrape the human-readable side.
– Martijn Pieters♦
Nov 23 '18 at 12:55
They should be using the GitHub API rather than scrape the human-readable side.
– Martijn Pieters♦
Nov 23 '18 at 12:55
Agreed, but if they need help with understanding why this simple script fails, I doubt they're up to date on using APIs.
– alexisdevarennes
Nov 23 '18 at 12:56
Agreed, but if they need help with understanding why this simple script fails, I doubt they're up to date on using APIs.
– alexisdevarennes
Nov 23 '18 at 12:56
This is the case indeed. I was able to reproduce the issue and got the following response back after a couple of requests: jsfiddle.net/48v3rt6o (
You have triggered an abuse detection mechanism. Please wait a few minutes before you try again.
)– Matias Cicero
Nov 23 '18 at 12:57
This is the case indeed. I was able to reproduce the issue and got the following response back after a couple of requests: jsfiddle.net/48v3rt6o (
You have triggered an abuse detection mechanism. Please wait a few minutes before you try again.
)– Matias Cicero
Nov 23 '18 at 12:57
@Max: contributor information is available just fine: developer.github.com/v3/repos/collaborators
– Martijn Pieters♦
Nov 23 '18 at 13:01
@Max: contributor information is available just fine: developer.github.com/v3/repos/collaborators
– Martijn Pieters♦
Nov 23 '18 at 13:01
1
1
@Max: those limitations are there for a reason. Trying to bypass limitations set by GitHub just means they'll limit you from accessing altogether. This is not a game you can win.
– Martijn Pieters♦
Nov 23 '18 at 13:05
@Max: those limitations are there for a reason. Trying to bypass limitations set by GitHub just means they'll limit you from accessing altogether. This is not a game you can win.
– Martijn Pieters♦
Nov 23 '18 at 13:05
|
show 7 more comments
Now this works perfectly for me while using the API. Probably the cleanest way of doing it.
import requests
import json
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'
response = requests.get(url)
commits = json.loads(response.text)
commits_total = len(commits)
page_number = 1
while(len(commits) == 100):
page_number += 1
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'+'&page='+str(page_number)
response = requests.get(url)
commits = json.loads(response.text)
commits_total += len(commits)
Don't guess at the page numbers. Instead, useresponse.links['next']['url']
(if set) or go straight toresponse.links['last']['url']
.
– Martijn Pieters♦
Nov 23 '18 at 16:02
Yes, this code is indeed not perfect, but I won't work at it any further.
– Max
Nov 23 '18 at 16:19
add a comment |
Now this works perfectly for me while using the API. Probably the cleanest way of doing it.
import requests
import json
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'
response = requests.get(url)
commits = json.loads(response.text)
commits_total = len(commits)
page_number = 1
while(len(commits) == 100):
page_number += 1
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'+'&page='+str(page_number)
response = requests.get(url)
commits = json.loads(response.text)
commits_total += len(commits)
Don't guess at the page numbers. Instead, useresponse.links['next']['url']
(if set) or go straight toresponse.links['last']['url']
.
– Martijn Pieters♦
Nov 23 '18 at 16:02
Yes, this code is indeed not perfect, but I won't work at it any further.
– Max
Nov 23 '18 at 16:19
add a comment |
Now this works perfectly for me while using the API. Probably the cleanest way of doing it.
import requests
import json
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'
response = requests.get(url)
commits = json.loads(response.text)
commits_total = len(commits)
page_number = 1
while(len(commits) == 100):
page_number += 1
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'+'&page='+str(page_number)
response = requests.get(url)
commits = json.loads(response.text)
commits_total += len(commits)
Now this works perfectly for me while using the API. Probably the cleanest way of doing it.
import requests
import json
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'
response = requests.get(url)
commits = json.loads(response.text)
commits_total = len(commits)
page_number = 1
while(len(commits) == 100):
page_number += 1
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'+'&page='+str(page_number)
response = requests.get(url)
commits = json.loads(response.text)
commits_total += len(commits)
answered Nov 23 '18 at 13:46
MaxMax
467
467
Don't guess at the page numbers. Instead, useresponse.links['next']['url']
(if set) or go straight toresponse.links['last']['url']
.
– Martijn Pieters♦
Nov 23 '18 at 16:02
Yes, this code is indeed not perfect, but I won't work at it any further.
– Max
Nov 23 '18 at 16:19
add a comment |
Don't guess at the page numbers. Instead, useresponse.links['next']['url']
(if set) or go straight toresponse.links['last']['url']
.
– Martijn Pieters♦
Nov 23 '18 at 16:02
Yes, this code is indeed not perfect, but I won't work at it any further.
– Max
Nov 23 '18 at 16:19
Don't guess at the page numbers. Instead, use
response.links['next']['url']
(if set) or go straight to response.links['last']['url']
.– Martijn Pieters♦
Nov 23 '18 at 16:02
Don't guess at the page numbers. Instead, use
response.links['next']['url']
(if set) or go straight to response.links['last']['url']
.– Martijn Pieters♦
Nov 23 '18 at 16:02
Yes, this code is indeed not perfect, but I won't work at it any further.
– Max
Nov 23 '18 at 16:19
Yes, this code is indeed not perfect, but I won't work at it any further.
– Max
Nov 23 '18 at 16:19
add a comment |
GitHub is blocking your repeated requests. Do not scrape sites in quick succession, many website operators actively block too many requests. As a result, the content that is returned no longer matches your XPath query.
You should be using the REST API that GitHub provides to retrieve project stats like the number of contributors, and you should implement some kind of rate limiting. There is no need to retrieve the same number 100 times, contributor counts do not change that rapidly.
API responses include information on how many requests you can make in a time window, and you can use conditional requests to only incur rate limit costs when the data actually has changed:
import requests
import time
from urllib.parse import parse_qsl, urlparse
owner, repo = 'tipsy', 'profile-summary-for-github'
github_username = '....'
# token = '....' # optional Github basic auth token
stats = 'https://api.github.com/repos/{}/{}/contributors'
with requests.session() as sess:
# GitHub requests you use your username or appname in the header
sess.headers['User-Agent'] += ' - {}'.format(github_username)
# Consider logging in! You'll get more quota
# sess.auth = (github_username, token)
# start with the first, move to the last when available, include anonymous
last_page = stats.format(owner, repo) + '?per_page=100&page=1&anon=true'
while True:
r = sess.get(last_page)
if r.status_code == requests.codes.not_found:
print("No such repo")
break
if r.status_code == requests.codes.no_content:
print("No contributors, repository is empty")
break
if r.status_code == requests.codes.accepted:
print("Stats not yet ready, retrying")
elif r.status_code == requests.codes.not_modified:
print("Stats not changed")
elif r.ok:
# success! Check for a last page, get that instead of current
# to get accurate count
link_last = r.links.get('last', {}).get('url')
if link_last and r.url != link_last:
last_page = link_last
else:
# this is the last page, report on count
params = dict(parse_qsl(urlparse(r.url).query))
page_num = int(params.get('page', '1'))
per_page = int(params.get('per_page', '100'))
contributor_count = len(r.json()) + (per_page * (page_num - 1))
print("Contributor count:", contributor_count)
# only get us a fresh response next time
sess.headers['If-None-Match'] = r.headers['ETag']
# pace ourselves following the rate limit
window_remaining = int(r.headers['X-RateLimit-Reset']) - time.time()
rate_remaining = int(r.headers['X-RateLimit-Remaining'])
# sleep long enough to honour the rate limit or at least 100 milliseconds
time.sleep(max(window_remaining / rate_remaining, 0.1))
The above uses a requests session object to handle repeated headers and ensure that you get to reuse connections where possible.
A good library such as github3.py
(incidentally written by a requests
core contributor) will take care of most of those details for you.
If you do want to persist on scraping the site directly, you do take a risk that the site operators block you altogether. Try to take some responsibility by not hammering the site continually.
That means that at the very least, you should honour the Retry-After
header that GitHub gives you on 429:
if not r.ok:
print("Received a response other that 200 OK:", r.status_code, r.reason)
retry_after = r.headers.get('Retry-After')
if retry_after is not None:
print("Response included a Retry-After:", retry_after)
time.sleep(int(retry_after))
else:
# parse OK response
This post is from '13, I'm able to get more than 100 by using the loop in my answer. Nevertheless I ran into the problem that I sent too many requests when I was trying to fetch the total amount of commits of the Linux kernel which has ~800.000 commits. I got blocked after 70 iterations in my while loop. The error message was: "API rate limit exceeded for [IP] (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"
– Max
Nov 23 '18 at 14:34
@Max: The code in my answer works for repos I tested. What's the Linux Kernel repository URL?
– Martijn Pieters♦
Nov 23 '18 at 14:53
github.com/torvalds/linux
– Max
Nov 23 '18 at 14:53
@Max: The history or contributor list is too large to list contributors for this repository via the API." and a 403 forbidden. Clone that repository and get the count from git directly.
– Martijn Pieters♦
Nov 23 '18 at 15:09
Alright, in my case it's easier to use the following git command since I download the repositories anyway. But in the near future, the easiest way is probably to use the new GitHub API V4 (developer.github.com/v4).git rev-list --all --count
– Max
Nov 23 '18 at 15:21
|
show 3 more comments
GitHub is blocking your repeated requests. Do not scrape sites in quick succession, many website operators actively block too many requests. As a result, the content that is returned no longer matches your XPath query.
You should be using the REST API that GitHub provides to retrieve project stats like the number of contributors, and you should implement some kind of rate limiting. There is no need to retrieve the same number 100 times, contributor counts do not change that rapidly.
API responses include information on how many requests you can make in a time window, and you can use conditional requests to only incur rate limit costs when the data actually has changed:
import requests
import time
from urllib.parse import parse_qsl, urlparse
owner, repo = 'tipsy', 'profile-summary-for-github'
github_username = '....'
# token = '....' # optional Github basic auth token
stats = 'https://api.github.com/repos/{}/{}/contributors'
with requests.session() as sess:
# GitHub requests you use your username or appname in the header
sess.headers['User-Agent'] += ' - {}'.format(github_username)
# Consider logging in! You'll get more quota
# sess.auth = (github_username, token)
# start with the first, move to the last when available, include anonymous
last_page = stats.format(owner, repo) + '?per_page=100&page=1&anon=true'
while True:
r = sess.get(last_page)
if r.status_code == requests.codes.not_found:
print("No such repo")
break
if r.status_code == requests.codes.no_content:
print("No contributors, repository is empty")
break
if r.status_code == requests.codes.accepted:
print("Stats not yet ready, retrying")
elif r.status_code == requests.codes.not_modified:
print("Stats not changed")
elif r.ok:
# success! Check for a last page, get that instead of current
# to get accurate count
link_last = r.links.get('last', {}).get('url')
if link_last and r.url != link_last:
last_page = link_last
else:
# this is the last page, report on count
params = dict(parse_qsl(urlparse(r.url).query))
page_num = int(params.get('page', '1'))
per_page = int(params.get('per_page', '100'))
contributor_count = len(r.json()) + (per_page * (page_num - 1))
print("Contributor count:", contributor_count)
# only get us a fresh response next time
sess.headers['If-None-Match'] = r.headers['ETag']
# pace ourselves following the rate limit
window_remaining = int(r.headers['X-RateLimit-Reset']) - time.time()
rate_remaining = int(r.headers['X-RateLimit-Remaining'])
# sleep long enough to honour the rate limit or at least 100 milliseconds
time.sleep(max(window_remaining / rate_remaining, 0.1))
The above uses a requests session object to handle repeated headers and ensure that you get to reuse connections where possible.
A good library such as github3.py
(incidentally written by a requests
core contributor) will take care of most of those details for you.
If you do want to persist on scraping the site directly, you do take a risk that the site operators block you altogether. Try to take some responsibility by not hammering the site continually.
That means that at the very least, you should honour the Retry-After
header that GitHub gives you on 429:
if not r.ok:
print("Received a response other that 200 OK:", r.status_code, r.reason)
retry_after = r.headers.get('Retry-After')
if retry_after is not None:
print("Response included a Retry-After:", retry_after)
time.sleep(int(retry_after))
else:
# parse OK response
This post is from '13, I'm able to get more than 100 by using the loop in my answer. Nevertheless I ran into the problem that I sent too many requests when I was trying to fetch the total amount of commits of the Linux kernel which has ~800.000 commits. I got blocked after 70 iterations in my while loop. The error message was: "API rate limit exceeded for [IP] (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"
– Max
Nov 23 '18 at 14:34
@Max: The code in my answer works for repos I tested. What's the Linux Kernel repository URL?
– Martijn Pieters♦
Nov 23 '18 at 14:53
github.com/torvalds/linux
– Max
Nov 23 '18 at 14:53
@Max: The history or contributor list is too large to list contributors for this repository via the API." and a 403 forbidden. Clone that repository and get the count from git directly.
– Martijn Pieters♦
Nov 23 '18 at 15:09
Alright, in my case it's easier to use the following git command since I download the repositories anyway. But in the near future, the easiest way is probably to use the new GitHub API V4 (developer.github.com/v4).git rev-list --all --count
– Max
Nov 23 '18 at 15:21
|
show 3 more comments
GitHub is blocking your repeated requests. Do not scrape sites in quick succession, many website operators actively block too many requests. As a result, the content that is returned no longer matches your XPath query.
You should be using the REST API that GitHub provides to retrieve project stats like the number of contributors, and you should implement some kind of rate limiting. There is no need to retrieve the same number 100 times, contributor counts do not change that rapidly.
API responses include information on how many requests you can make in a time window, and you can use conditional requests to only incur rate limit costs when the data actually has changed:
import requests
import time
from urllib.parse import parse_qsl, urlparse
owner, repo = 'tipsy', 'profile-summary-for-github'
github_username = '....'
# token = '....' # optional Github basic auth token
stats = 'https://api.github.com/repos/{}/{}/contributors'
with requests.session() as sess:
# GitHub requests you use your username or appname in the header
sess.headers['User-Agent'] += ' - {}'.format(github_username)
# Consider logging in! You'll get more quota
# sess.auth = (github_username, token)
# start with the first, move to the last when available, include anonymous
last_page = stats.format(owner, repo) + '?per_page=100&page=1&anon=true'
while True:
r = sess.get(last_page)
if r.status_code == requests.codes.not_found:
print("No such repo")
break
if r.status_code == requests.codes.no_content:
print("No contributors, repository is empty")
break
if r.status_code == requests.codes.accepted:
print("Stats not yet ready, retrying")
elif r.status_code == requests.codes.not_modified:
print("Stats not changed")
elif r.ok:
# success! Check for a last page, get that instead of current
# to get accurate count
link_last = r.links.get('last', {}).get('url')
if link_last and r.url != link_last:
last_page = link_last
else:
# this is the last page, report on count
params = dict(parse_qsl(urlparse(r.url).query))
page_num = int(params.get('page', '1'))
per_page = int(params.get('per_page', '100'))
contributor_count = len(r.json()) + (per_page * (page_num - 1))
print("Contributor count:", contributor_count)
# only get us a fresh response next time
sess.headers['If-None-Match'] = r.headers['ETag']
# pace ourselves following the rate limit
window_remaining = int(r.headers['X-RateLimit-Reset']) - time.time()
rate_remaining = int(r.headers['X-RateLimit-Remaining'])
# sleep long enough to honour the rate limit or at least 100 milliseconds
time.sleep(max(window_remaining / rate_remaining, 0.1))
The above uses a requests session object to handle repeated headers and ensure that you get to reuse connections where possible.
A good library such as github3.py
(incidentally written by a requests
core contributor) will take care of most of those details for you.
If you do want to persist on scraping the site directly, you do take a risk that the site operators block you altogether. Try to take some responsibility by not hammering the site continually.
That means that at the very least, you should honour the Retry-After
header that GitHub gives you on 429:
if not r.ok:
print("Received a response other that 200 OK:", r.status_code, r.reason)
retry_after = r.headers.get('Retry-After')
if retry_after is not None:
print("Response included a Retry-After:", retry_after)
time.sleep(int(retry_after))
else:
# parse OK response
GitHub is blocking your repeated requests. Do not scrape sites in quick succession, many website operators actively block too many requests. As a result, the content that is returned no longer matches your XPath query.
You should be using the REST API that GitHub provides to retrieve project stats like the number of contributors, and you should implement some kind of rate limiting. There is no need to retrieve the same number 100 times, contributor counts do not change that rapidly.
API responses include information on how many requests you can make in a time window, and you can use conditional requests to only incur rate limit costs when the data actually has changed:
import requests
import time
from urllib.parse import parse_qsl, urlparse
owner, repo = 'tipsy', 'profile-summary-for-github'
github_username = '....'
# token = '....' # optional Github basic auth token
stats = 'https://api.github.com/repos/{}/{}/contributors'
with requests.session() as sess:
# GitHub requests you use your username or appname in the header
sess.headers['User-Agent'] += ' - {}'.format(github_username)
# Consider logging in! You'll get more quota
# sess.auth = (github_username, token)
# start with the first, move to the last when available, include anonymous
last_page = stats.format(owner, repo) + '?per_page=100&page=1&anon=true'
while True:
r = sess.get(last_page)
if r.status_code == requests.codes.not_found:
print("No such repo")
break
if r.status_code == requests.codes.no_content:
print("No contributors, repository is empty")
break
if r.status_code == requests.codes.accepted:
print("Stats not yet ready, retrying")
elif r.status_code == requests.codes.not_modified:
print("Stats not changed")
elif r.ok:
# success! Check for a last page, get that instead of current
# to get accurate count
link_last = r.links.get('last', {}).get('url')
if link_last and r.url != link_last:
last_page = link_last
else:
# this is the last page, report on count
params = dict(parse_qsl(urlparse(r.url).query))
page_num = int(params.get('page', '1'))
per_page = int(params.get('per_page', '100'))
contributor_count = len(r.json()) + (per_page * (page_num - 1))
print("Contributor count:", contributor_count)
# only get us a fresh response next time
sess.headers['If-None-Match'] = r.headers['ETag']
# pace ourselves following the rate limit
window_remaining = int(r.headers['X-RateLimit-Reset']) - time.time()
rate_remaining = int(r.headers['X-RateLimit-Remaining'])
# sleep long enough to honour the rate limit or at least 100 milliseconds
time.sleep(max(window_remaining / rate_remaining, 0.1))
The above uses a requests session object to handle repeated headers and ensure that you get to reuse connections where possible.
A good library such as github3.py
(incidentally written by a requests
core contributor) will take care of most of those details for you.
If you do want to persist on scraping the site directly, you do take a risk that the site operators block you altogether. Try to take some responsibility by not hammering the site continually.
That means that at the very least, you should honour the Retry-After
header that GitHub gives you on 429:
if not r.ok:
print("Received a response other that 200 OK:", r.status_code, r.reason)
retry_after = r.headers.get('Retry-After')
if retry_after is not None:
print("Response included a Retry-After:", retry_after)
time.sleep(int(retry_after))
else:
# parse OK response
edited Nov 23 '18 at 14:56
answered Nov 23 '18 at 13:00
Martijn Pieters♦Martijn Pieters
709k13624772298
709k13624772298
This post is from '13, I'm able to get more than 100 by using the loop in my answer. Nevertheless I ran into the problem that I sent too many requests when I was trying to fetch the total amount of commits of the Linux kernel which has ~800.000 commits. I got blocked after 70 iterations in my while loop. The error message was: "API rate limit exceeded for [IP] (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"
– Max
Nov 23 '18 at 14:34
@Max: The code in my answer works for repos I tested. What's the Linux Kernel repository URL?
– Martijn Pieters♦
Nov 23 '18 at 14:53
github.com/torvalds/linux
– Max
Nov 23 '18 at 14:53
@Max: The history or contributor list is too large to list contributors for this repository via the API." and a 403 forbidden. Clone that repository and get the count from git directly.
– Martijn Pieters♦
Nov 23 '18 at 15:09
Alright, in my case it's easier to use the following git command since I download the repositories anyway. But in the near future, the easiest way is probably to use the new GitHub API V4 (developer.github.com/v4).git rev-list --all --count
– Max
Nov 23 '18 at 15:21
|
show 3 more comments
This post is from '13, I'm able to get more than 100 by using the loop in my answer. Nevertheless I ran into the problem that I sent too many requests when I was trying to fetch the total amount of commits of the Linux kernel which has ~800.000 commits. I got blocked after 70 iterations in my while loop. The error message was: "API rate limit exceeded for [IP] (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"
– Max
Nov 23 '18 at 14:34
@Max: The code in my answer works for repos I tested. What's the Linux Kernel repository URL?
– Martijn Pieters♦
Nov 23 '18 at 14:53
github.com/torvalds/linux
– Max
Nov 23 '18 at 14:53
@Max: The history or contributor list is too large to list contributors for this repository via the API." and a 403 forbidden. Clone that repository and get the count from git directly.
– Martijn Pieters♦
Nov 23 '18 at 15:09
Alright, in my case it's easier to use the following git command since I download the repositories anyway. But in the near future, the easiest way is probably to use the new GitHub API V4 (developer.github.com/v4).git rev-list --all --count
– Max
Nov 23 '18 at 15:21
This post is from '13, I'm able to get more than 100 by using the loop in my answer. Nevertheless I ran into the problem that I sent too many requests when I was trying to fetch the total amount of commits of the Linux kernel which has ~800.000 commits. I got blocked after 70 iterations in my while loop. The error message was: "API rate limit exceeded for [IP] (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"
– Max
Nov 23 '18 at 14:34
This post is from '13, I'm able to get more than 100 by using the loop in my answer. Nevertheless I ran into the problem that I sent too many requests when I was trying to fetch the total amount of commits of the Linux kernel which has ~800.000 commits. I got blocked after 70 iterations in my while loop. The error message was: "API rate limit exceeded for [IP] (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"
– Max
Nov 23 '18 at 14:34
@Max: The code in my answer works for repos I tested. What's the Linux Kernel repository URL?
– Martijn Pieters♦
Nov 23 '18 at 14:53
@Max: The code in my answer works for repos I tested. What's the Linux Kernel repository URL?
– Martijn Pieters♦
Nov 23 '18 at 14:53
github.com/torvalds/linux
– Max
Nov 23 '18 at 14:53
github.com/torvalds/linux
– Max
Nov 23 '18 at 14:53
@Max: The history or contributor list is too large to list contributors for this repository via the API." and a 403 forbidden. Clone that repository and get the count from git directly.
– Martijn Pieters♦
Nov 23 '18 at 15:09
@Max: The history or contributor list is too large to list contributors for this repository via the API." and a 403 forbidden. Clone that repository and get the count from git directly.
– Martijn Pieters♦
Nov 23 '18 at 15:09
Alright, in my case it's easier to use the following git command since I download the repositories anyway. But in the near future, the easiest way is probably to use the new GitHub API V4 (developer.github.com/v4).
git rev-list --all --count
– Max
Nov 23 '18 at 15:21
Alright, in my case it's easier to use the following git command since I download the repositories anyway. But in the near future, the easiest way is probably to use the new GitHub API V4 (developer.github.com/v4).
git rev-list --all --count
– Max
Nov 23 '18 at 15:21
|
show 3 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53447036%2findex-out-of-range-when-sending-requests-in-a-loop%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
3
You are almost certainly being blocked by GitHub. Use their API, don't scrape the site.
– Martijn Pieters♦
Nov 23 '18 at 12:55