Index out of range when sending requests in a loop












-1















I encounter an index out of range error when I try to get the number of contributors of a GitHub project in a loop. After some iterations (which are working perfectly) it just throws that exception. I have no clue why ...



    for x in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number) # prints the correct number until the exception


Here's the exception.



----> 4     contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
IndexError: list index out of range









share|improve this question


















  • 3





    You are almost certainly being blocked by GitHub. Use their API, don't scrape the site.

    – Martijn Pieters
    Nov 23 '18 at 12:55
















-1















I encounter an index out of range error when I try to get the number of contributors of a GitHub project in a loop. After some iterations (which are working perfectly) it just throws that exception. I have no clue why ...



    for x in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number) # prints the correct number until the exception


Here's the exception.



----> 4     contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
IndexError: list index out of range









share|improve this question


















  • 3





    You are almost certainly being blocked by GitHub. Use their API, don't scrape the site.

    – Martijn Pieters
    Nov 23 '18 at 12:55














-1












-1








-1








I encounter an index out of range error when I try to get the number of contributors of a GitHub project in a loop. After some iterations (which are working perfectly) it just throws that exception. I have no clue why ...



    for x in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number) # prints the correct number until the exception


Here's the exception.



----> 4     contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
IndexError: list index out of range









share|improve this question














I encounter an index out of range error when I try to get the number of contributors of a GitHub project in a loop. After some iterations (which are working perfectly) it just throws that exception. I have no clue why ...



    for x in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number) # prints the correct number until the exception


Here's the exception.



----> 4     contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
IndexError: list index out of range






python indexoutofrangeexception






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 23 '18 at 12:48









MaxMax

467




467








  • 3





    You are almost certainly being blocked by GitHub. Use their API, don't scrape the site.

    – Martijn Pieters
    Nov 23 '18 at 12:55














  • 3





    You are almost certainly being blocked by GitHub. Use their API, don't scrape the site.

    – Martijn Pieters
    Nov 23 '18 at 12:55








3




3





You are almost certainly being blocked by GitHub. Use their API, don't scrape the site.

– Martijn Pieters
Nov 23 '18 at 12:55





You are almost certainly being blocked by GitHub. Use their API, don't scrape the site.

– Martijn Pieters
Nov 23 '18 at 12:55












3 Answers
3






active

oldest

votes


















2














It seems likely that you're getting a 429 - Too many requests since you're firing requests of one after the other.



You might want to modify your code as such:



import time

for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
time.sleep(3) # Wait a bit before firing of another request


Better yet would be:



import time

for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
if r.status_code in [200]: # Check if the request was successful
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
else:
print("Failed fetching page, status code: " + str(r.status_code))
time.sleep(3) # Wait a bit before firing of another request





share|improve this answer


























  • They should be using the GitHub API rather than scrape the human-readable side.

    – Martijn Pieters
    Nov 23 '18 at 12:55











  • Agreed, but if they need help with understanding why this simple script fails, I doubt they're up to date on using APIs.

    – alexisdevarennes
    Nov 23 '18 at 12:56











  • This is the case indeed. I was able to reproduce the issue and got the following response back after a couple of requests: jsfiddle.net/48v3rt6o (You have triggered an abuse detection mechanism. Please wait a few minutes before you try again.)

    – Matias Cicero
    Nov 23 '18 at 12:57













  • @Max: contributor information is available just fine: developer.github.com/v3/repos/collaborators

    – Martijn Pieters
    Nov 23 '18 at 13:01








  • 1





    @Max: those limitations are there for a reason. Trying to bypass limitations set by GitHub just means they'll limit you from accessing altogether. This is not a game you can win.

    – Martijn Pieters
    Nov 23 '18 at 13:05



















0














Now this works perfectly for me while using the API. Probably the cleanest way of doing it.



import requests
import json

url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'
response = requests.get(url)
commits = json.loads(response.text)

commits_total = len(commits)
page_number = 1
while(len(commits) == 100):
page_number += 1
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'+'&page='+str(page_number)
response = requests.get(url)
commits = json.loads(response.text)
commits_total += len(commits)





share|improve this answer
























  • Don't guess at the page numbers. Instead, use response.links['next']['url'] (if set) or go straight to response.links['last']['url'].

    – Martijn Pieters
    Nov 23 '18 at 16:02











  • Yes, this code is indeed not perfect, but I won't work at it any further.

    – Max
    Nov 23 '18 at 16:19



















0














GitHub is blocking your repeated requests. Do not scrape sites in quick succession, many website operators actively block too many requests. As a result, the content that is returned no longer matches your XPath query.



You should be using the REST API that GitHub provides to retrieve project stats like the number of contributors, and you should implement some kind of rate limiting. There is no need to retrieve the same number 100 times, contributor counts do not change that rapidly.



API responses include information on how many requests you can make in a time window, and you can use conditional requests to only incur rate limit costs when the data actually has changed:



import requests
import time
from urllib.parse import parse_qsl, urlparse

owner, repo = 'tipsy', 'profile-summary-for-github'
github_username = '....'
# token = '....' # optional Github basic auth token
stats = 'https://api.github.com/repos/{}/{}/contributors'

with requests.session() as sess:
# GitHub requests you use your username or appname in the header
sess.headers['User-Agent'] += ' - {}'.format(github_username)
# Consider logging in! You'll get more quota
# sess.auth = (github_username, token)

# start with the first, move to the last when available, include anonymous
last_page = stats.format(owner, repo) + '?per_page=100&page=1&anon=true'

while True:
r = sess.get(last_page)
if r.status_code == requests.codes.not_found:
print("No such repo")
break
if r.status_code == requests.codes.no_content:
print("No contributors, repository is empty")
break
if r.status_code == requests.codes.accepted:
print("Stats not yet ready, retrying")
elif r.status_code == requests.codes.not_modified:
print("Stats not changed")
elif r.ok:
# success! Check for a last page, get that instead of current
# to get accurate count
link_last = r.links.get('last', {}).get('url')
if link_last and r.url != link_last:
last_page = link_last
else:
# this is the last page, report on count
params = dict(parse_qsl(urlparse(r.url).query))
page_num = int(params.get('page', '1'))
per_page = int(params.get('per_page', '100'))
contributor_count = len(r.json()) + (per_page * (page_num - 1))
print("Contributor count:", contributor_count)
# only get us a fresh response next time
sess.headers['If-None-Match'] = r.headers['ETag']

# pace ourselves following the rate limit
window_remaining = int(r.headers['X-RateLimit-Reset']) - time.time()
rate_remaining = int(r.headers['X-RateLimit-Remaining'])
# sleep long enough to honour the rate limit or at least 100 milliseconds
time.sleep(max(window_remaining / rate_remaining, 0.1))


The above uses a requests session object to handle repeated headers and ensure that you get to reuse connections where possible.



A good library such as github3.py (incidentally written by a requests core contributor) will take care of most of those details for you.



If you do want to persist on scraping the site directly, you do take a risk that the site operators block you altogether. Try to take some responsibility by not hammering the site continually.



That means that at the very least, you should honour the Retry-After header that GitHub gives you on 429:



if not r.ok:
print("Received a response other that 200 OK:", r.status_code, r.reason)
retry_after = r.headers.get('Retry-After')
if retry_after is not None:
print("Response included a Retry-After:", retry_after)
time.sleep(int(retry_after))
else:
# parse OK response





share|improve this answer


























  • This post is from '13, I'm able to get more than 100 by using the loop in my answer. Nevertheless I ran into the problem that I sent too many requests when I was trying to fetch the total amount of commits of the Linux kernel which has ~800.000 commits. I got blocked after 70 iterations in my while loop. The error message was: "API rate limit exceeded for [IP] (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"

    – Max
    Nov 23 '18 at 14:34













  • @Max: The code in my answer works for repos I tested. What's the Linux Kernel repository URL?

    – Martijn Pieters
    Nov 23 '18 at 14:53











  • github.com/torvalds/linux

    – Max
    Nov 23 '18 at 14:53











  • @Max: The history or contributor list is too large to list contributors for this repository via the API." and a 403 forbidden. Clone that repository and get the count from git directly.

    – Martijn Pieters
    Nov 23 '18 at 15:09











  • Alright, in my case it's easier to use the following git command since I download the repositories anyway. But in the near future, the easiest way is probably to use the new GitHub API V4 (developer.github.com/v4). git rev-list --all --count

    – Max
    Nov 23 '18 at 15:21













Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53447036%2findex-out-of-range-when-sending-requests-in-a-loop%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























3 Answers
3






active

oldest

votes








3 Answers
3






active

oldest

votes









active

oldest

votes






active

oldest

votes









2














It seems likely that you're getting a 429 - Too many requests since you're firing requests of one after the other.



You might want to modify your code as such:



import time

for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
time.sleep(3) # Wait a bit before firing of another request


Better yet would be:



import time

for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
if r.status_code in [200]: # Check if the request was successful
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
else:
print("Failed fetching page, status code: " + str(r.status_code))
time.sleep(3) # Wait a bit before firing of another request





share|improve this answer


























  • They should be using the GitHub API rather than scrape the human-readable side.

    – Martijn Pieters
    Nov 23 '18 at 12:55











  • Agreed, but if they need help with understanding why this simple script fails, I doubt they're up to date on using APIs.

    – alexisdevarennes
    Nov 23 '18 at 12:56











  • This is the case indeed. I was able to reproduce the issue and got the following response back after a couple of requests: jsfiddle.net/48v3rt6o (You have triggered an abuse detection mechanism. Please wait a few minutes before you try again.)

    – Matias Cicero
    Nov 23 '18 at 12:57













  • @Max: contributor information is available just fine: developer.github.com/v3/repos/collaborators

    – Martijn Pieters
    Nov 23 '18 at 13:01








  • 1





    @Max: those limitations are there for a reason. Trying to bypass limitations set by GitHub just means they'll limit you from accessing altogether. This is not a game you can win.

    – Martijn Pieters
    Nov 23 '18 at 13:05
















2














It seems likely that you're getting a 429 - Too many requests since you're firing requests of one after the other.



You might want to modify your code as such:



import time

for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
time.sleep(3) # Wait a bit before firing of another request


Better yet would be:



import time

for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
if r.status_code in [200]: # Check if the request was successful
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
else:
print("Failed fetching page, status code: " + str(r.status_code))
time.sleep(3) # Wait a bit before firing of another request





share|improve this answer


























  • They should be using the GitHub API rather than scrape the human-readable side.

    – Martijn Pieters
    Nov 23 '18 at 12:55











  • Agreed, but if they need help with understanding why this simple script fails, I doubt they're up to date on using APIs.

    – alexisdevarennes
    Nov 23 '18 at 12:56











  • This is the case indeed. I was able to reproduce the issue and got the following response back after a couple of requests: jsfiddle.net/48v3rt6o (You have triggered an abuse detection mechanism. Please wait a few minutes before you try again.)

    – Matias Cicero
    Nov 23 '18 at 12:57













  • @Max: contributor information is available just fine: developer.github.com/v3/repos/collaborators

    – Martijn Pieters
    Nov 23 '18 at 13:01








  • 1





    @Max: those limitations are there for a reason. Trying to bypass limitations set by GitHub just means they'll limit you from accessing altogether. This is not a game you can win.

    – Martijn Pieters
    Nov 23 '18 at 13:05














2












2








2







It seems likely that you're getting a 429 - Too many requests since you're firing requests of one after the other.



You might want to modify your code as such:



import time

for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
time.sleep(3) # Wait a bit before firing of another request


Better yet would be:



import time

for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
if r.status_code in [200]: # Check if the request was successful
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
else:
print("Failed fetching page, status code: " + str(r.status_code))
time.sleep(3) # Wait a bit before firing of another request





share|improve this answer















It seems likely that you're getting a 429 - Too many requests since you're firing requests of one after the other.



You might want to modify your code as such:



import time

for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
time.sleep(3) # Wait a bit before firing of another request


Better yet would be:



import time

for index in range(100):
r = requests.get('https://github.com/tipsy/profile-summary-for-github')
if r.status_code in [200]: # Check if the request was successful
xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'
contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))
print(contributors_number)
else:
print("Failed fetching page, status code: " + str(r.status_code))
time.sleep(3) # Wait a bit before firing of another request






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 23 '18 at 12:58

























answered Nov 23 '18 at 12:54









alexisdevarennesalexisdevarennes

1,96721229




1,96721229













  • They should be using the GitHub API rather than scrape the human-readable side.

    – Martijn Pieters
    Nov 23 '18 at 12:55











  • Agreed, but if they need help with understanding why this simple script fails, I doubt they're up to date on using APIs.

    – alexisdevarennes
    Nov 23 '18 at 12:56











  • This is the case indeed. I was able to reproduce the issue and got the following response back after a couple of requests: jsfiddle.net/48v3rt6o (You have triggered an abuse detection mechanism. Please wait a few minutes before you try again.)

    – Matias Cicero
    Nov 23 '18 at 12:57













  • @Max: contributor information is available just fine: developer.github.com/v3/repos/collaborators

    – Martijn Pieters
    Nov 23 '18 at 13:01








  • 1





    @Max: those limitations are there for a reason. Trying to bypass limitations set by GitHub just means they'll limit you from accessing altogether. This is not a game you can win.

    – Martijn Pieters
    Nov 23 '18 at 13:05



















  • They should be using the GitHub API rather than scrape the human-readable side.

    – Martijn Pieters
    Nov 23 '18 at 12:55











  • Agreed, but if they need help with understanding why this simple script fails, I doubt they're up to date on using APIs.

    – alexisdevarennes
    Nov 23 '18 at 12:56











  • This is the case indeed. I was able to reproduce the issue and got the following response back after a couple of requests: jsfiddle.net/48v3rt6o (You have triggered an abuse detection mechanism. Please wait a few minutes before you try again.)

    – Matias Cicero
    Nov 23 '18 at 12:57













  • @Max: contributor information is available just fine: developer.github.com/v3/repos/collaborators

    – Martijn Pieters
    Nov 23 '18 at 13:01








  • 1





    @Max: those limitations are there for a reason. Trying to bypass limitations set by GitHub just means they'll limit you from accessing altogether. This is not a game you can win.

    – Martijn Pieters
    Nov 23 '18 at 13:05

















They should be using the GitHub API rather than scrape the human-readable side.

– Martijn Pieters
Nov 23 '18 at 12:55





They should be using the GitHub API rather than scrape the human-readable side.

– Martijn Pieters
Nov 23 '18 at 12:55













Agreed, but if they need help with understanding why this simple script fails, I doubt they're up to date on using APIs.

– alexisdevarennes
Nov 23 '18 at 12:56





Agreed, but if they need help with understanding why this simple script fails, I doubt they're up to date on using APIs.

– alexisdevarennes
Nov 23 '18 at 12:56













This is the case indeed. I was able to reproduce the issue and got the following response back after a couple of requests: jsfiddle.net/48v3rt6o (You have triggered an abuse detection mechanism. Please wait a few minutes before you try again.)

– Matias Cicero
Nov 23 '18 at 12:57







This is the case indeed. I was able to reproduce the issue and got the following response back after a couple of requests: jsfiddle.net/48v3rt6o (You have triggered an abuse detection mechanism. Please wait a few minutes before you try again.)

– Matias Cicero
Nov 23 '18 at 12:57















@Max: contributor information is available just fine: developer.github.com/v3/repos/collaborators

– Martijn Pieters
Nov 23 '18 at 13:01







@Max: contributor information is available just fine: developer.github.com/v3/repos/collaborators

– Martijn Pieters
Nov 23 '18 at 13:01






1




1





@Max: those limitations are there for a reason. Trying to bypass limitations set by GitHub just means they'll limit you from accessing altogether. This is not a game you can win.

– Martijn Pieters
Nov 23 '18 at 13:05





@Max: those limitations are there for a reason. Trying to bypass limitations set by GitHub just means they'll limit you from accessing altogether. This is not a game you can win.

– Martijn Pieters
Nov 23 '18 at 13:05













0














Now this works perfectly for me while using the API. Probably the cleanest way of doing it.



import requests
import json

url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'
response = requests.get(url)
commits = json.loads(response.text)

commits_total = len(commits)
page_number = 1
while(len(commits) == 100):
page_number += 1
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'+'&page='+str(page_number)
response = requests.get(url)
commits = json.loads(response.text)
commits_total += len(commits)





share|improve this answer
























  • Don't guess at the page numbers. Instead, use response.links['next']['url'] (if set) or go straight to response.links['last']['url'].

    – Martijn Pieters
    Nov 23 '18 at 16:02











  • Yes, this code is indeed not perfect, but I won't work at it any further.

    – Max
    Nov 23 '18 at 16:19
















0














Now this works perfectly for me while using the API. Probably the cleanest way of doing it.



import requests
import json

url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'
response = requests.get(url)
commits = json.loads(response.text)

commits_total = len(commits)
page_number = 1
while(len(commits) == 100):
page_number += 1
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'+'&page='+str(page_number)
response = requests.get(url)
commits = json.loads(response.text)
commits_total += len(commits)





share|improve this answer
























  • Don't guess at the page numbers. Instead, use response.links['next']['url'] (if set) or go straight to response.links['last']['url'].

    – Martijn Pieters
    Nov 23 '18 at 16:02











  • Yes, this code is indeed not perfect, but I won't work at it any further.

    – Max
    Nov 23 '18 at 16:19














0












0








0







Now this works perfectly for me while using the API. Probably the cleanest way of doing it.



import requests
import json

url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'
response = requests.get(url)
commits = json.loads(response.text)

commits_total = len(commits)
page_number = 1
while(len(commits) == 100):
page_number += 1
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'+'&page='+str(page_number)
response = requests.get(url)
commits = json.loads(response.text)
commits_total += len(commits)





share|improve this answer













Now this works perfectly for me while using the API. Probably the cleanest way of doing it.



import requests
import json

url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'
response = requests.get(url)
commits = json.loads(response.text)

commits_total = len(commits)
page_number = 1
while(len(commits) == 100):
page_number += 1
url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'+'&page='+str(page_number)
response = requests.get(url)
commits = json.loads(response.text)
commits_total += len(commits)






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 23 '18 at 13:46









MaxMax

467




467













  • Don't guess at the page numbers. Instead, use response.links['next']['url'] (if set) or go straight to response.links['last']['url'].

    – Martijn Pieters
    Nov 23 '18 at 16:02











  • Yes, this code is indeed not perfect, but I won't work at it any further.

    – Max
    Nov 23 '18 at 16:19



















  • Don't guess at the page numbers. Instead, use response.links['next']['url'] (if set) or go straight to response.links['last']['url'].

    – Martijn Pieters
    Nov 23 '18 at 16:02











  • Yes, this code is indeed not perfect, but I won't work at it any further.

    – Max
    Nov 23 '18 at 16:19

















Don't guess at the page numbers. Instead, use response.links['next']['url'] (if set) or go straight to response.links['last']['url'].

– Martijn Pieters
Nov 23 '18 at 16:02





Don't guess at the page numbers. Instead, use response.links['next']['url'] (if set) or go straight to response.links['last']['url'].

– Martijn Pieters
Nov 23 '18 at 16:02













Yes, this code is indeed not perfect, but I won't work at it any further.

– Max
Nov 23 '18 at 16:19





Yes, this code is indeed not perfect, but I won't work at it any further.

– Max
Nov 23 '18 at 16:19











0














GitHub is blocking your repeated requests. Do not scrape sites in quick succession, many website operators actively block too many requests. As a result, the content that is returned no longer matches your XPath query.



You should be using the REST API that GitHub provides to retrieve project stats like the number of contributors, and you should implement some kind of rate limiting. There is no need to retrieve the same number 100 times, contributor counts do not change that rapidly.



API responses include information on how many requests you can make in a time window, and you can use conditional requests to only incur rate limit costs when the data actually has changed:



import requests
import time
from urllib.parse import parse_qsl, urlparse

owner, repo = 'tipsy', 'profile-summary-for-github'
github_username = '....'
# token = '....' # optional Github basic auth token
stats = 'https://api.github.com/repos/{}/{}/contributors'

with requests.session() as sess:
# GitHub requests you use your username or appname in the header
sess.headers['User-Agent'] += ' - {}'.format(github_username)
# Consider logging in! You'll get more quota
# sess.auth = (github_username, token)

# start with the first, move to the last when available, include anonymous
last_page = stats.format(owner, repo) + '?per_page=100&page=1&anon=true'

while True:
r = sess.get(last_page)
if r.status_code == requests.codes.not_found:
print("No such repo")
break
if r.status_code == requests.codes.no_content:
print("No contributors, repository is empty")
break
if r.status_code == requests.codes.accepted:
print("Stats not yet ready, retrying")
elif r.status_code == requests.codes.not_modified:
print("Stats not changed")
elif r.ok:
# success! Check for a last page, get that instead of current
# to get accurate count
link_last = r.links.get('last', {}).get('url')
if link_last and r.url != link_last:
last_page = link_last
else:
# this is the last page, report on count
params = dict(parse_qsl(urlparse(r.url).query))
page_num = int(params.get('page', '1'))
per_page = int(params.get('per_page', '100'))
contributor_count = len(r.json()) + (per_page * (page_num - 1))
print("Contributor count:", contributor_count)
# only get us a fresh response next time
sess.headers['If-None-Match'] = r.headers['ETag']

# pace ourselves following the rate limit
window_remaining = int(r.headers['X-RateLimit-Reset']) - time.time()
rate_remaining = int(r.headers['X-RateLimit-Remaining'])
# sleep long enough to honour the rate limit or at least 100 milliseconds
time.sleep(max(window_remaining / rate_remaining, 0.1))


The above uses a requests session object to handle repeated headers and ensure that you get to reuse connections where possible.



A good library such as github3.py (incidentally written by a requests core contributor) will take care of most of those details for you.



If you do want to persist on scraping the site directly, you do take a risk that the site operators block you altogether. Try to take some responsibility by not hammering the site continually.



That means that at the very least, you should honour the Retry-After header that GitHub gives you on 429:



if not r.ok:
print("Received a response other that 200 OK:", r.status_code, r.reason)
retry_after = r.headers.get('Retry-After')
if retry_after is not None:
print("Response included a Retry-After:", retry_after)
time.sleep(int(retry_after))
else:
# parse OK response





share|improve this answer


























  • This post is from '13, I'm able to get more than 100 by using the loop in my answer. Nevertheless I ran into the problem that I sent too many requests when I was trying to fetch the total amount of commits of the Linux kernel which has ~800.000 commits. I got blocked after 70 iterations in my while loop. The error message was: "API rate limit exceeded for [IP] (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"

    – Max
    Nov 23 '18 at 14:34













  • @Max: The code in my answer works for repos I tested. What's the Linux Kernel repository URL?

    – Martijn Pieters
    Nov 23 '18 at 14:53











  • github.com/torvalds/linux

    – Max
    Nov 23 '18 at 14:53











  • @Max: The history or contributor list is too large to list contributors for this repository via the API." and a 403 forbidden. Clone that repository and get the count from git directly.

    – Martijn Pieters
    Nov 23 '18 at 15:09











  • Alright, in my case it's easier to use the following git command since I download the repositories anyway. But in the near future, the easiest way is probably to use the new GitHub API V4 (developer.github.com/v4). git rev-list --all --count

    – Max
    Nov 23 '18 at 15:21


















0














GitHub is blocking your repeated requests. Do not scrape sites in quick succession, many website operators actively block too many requests. As a result, the content that is returned no longer matches your XPath query.



You should be using the REST API that GitHub provides to retrieve project stats like the number of contributors, and you should implement some kind of rate limiting. There is no need to retrieve the same number 100 times, contributor counts do not change that rapidly.



API responses include information on how many requests you can make in a time window, and you can use conditional requests to only incur rate limit costs when the data actually has changed:



import requests
import time
from urllib.parse import parse_qsl, urlparse

owner, repo = 'tipsy', 'profile-summary-for-github'
github_username = '....'
# token = '....' # optional Github basic auth token
stats = 'https://api.github.com/repos/{}/{}/contributors'

with requests.session() as sess:
# GitHub requests you use your username or appname in the header
sess.headers['User-Agent'] += ' - {}'.format(github_username)
# Consider logging in! You'll get more quota
# sess.auth = (github_username, token)

# start with the first, move to the last when available, include anonymous
last_page = stats.format(owner, repo) + '?per_page=100&page=1&anon=true'

while True:
r = sess.get(last_page)
if r.status_code == requests.codes.not_found:
print("No such repo")
break
if r.status_code == requests.codes.no_content:
print("No contributors, repository is empty")
break
if r.status_code == requests.codes.accepted:
print("Stats not yet ready, retrying")
elif r.status_code == requests.codes.not_modified:
print("Stats not changed")
elif r.ok:
# success! Check for a last page, get that instead of current
# to get accurate count
link_last = r.links.get('last', {}).get('url')
if link_last and r.url != link_last:
last_page = link_last
else:
# this is the last page, report on count
params = dict(parse_qsl(urlparse(r.url).query))
page_num = int(params.get('page', '1'))
per_page = int(params.get('per_page', '100'))
contributor_count = len(r.json()) + (per_page * (page_num - 1))
print("Contributor count:", contributor_count)
# only get us a fresh response next time
sess.headers['If-None-Match'] = r.headers['ETag']

# pace ourselves following the rate limit
window_remaining = int(r.headers['X-RateLimit-Reset']) - time.time()
rate_remaining = int(r.headers['X-RateLimit-Remaining'])
# sleep long enough to honour the rate limit or at least 100 milliseconds
time.sleep(max(window_remaining / rate_remaining, 0.1))


The above uses a requests session object to handle repeated headers and ensure that you get to reuse connections where possible.



A good library such as github3.py (incidentally written by a requests core contributor) will take care of most of those details for you.



If you do want to persist on scraping the site directly, you do take a risk that the site operators block you altogether. Try to take some responsibility by not hammering the site continually.



That means that at the very least, you should honour the Retry-After header that GitHub gives you on 429:



if not r.ok:
print("Received a response other that 200 OK:", r.status_code, r.reason)
retry_after = r.headers.get('Retry-After')
if retry_after is not None:
print("Response included a Retry-After:", retry_after)
time.sleep(int(retry_after))
else:
# parse OK response





share|improve this answer


























  • This post is from '13, I'm able to get more than 100 by using the loop in my answer. Nevertheless I ran into the problem that I sent too many requests when I was trying to fetch the total amount of commits of the Linux kernel which has ~800.000 commits. I got blocked after 70 iterations in my while loop. The error message was: "API rate limit exceeded for [IP] (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"

    – Max
    Nov 23 '18 at 14:34













  • @Max: The code in my answer works for repos I tested. What's the Linux Kernel repository URL?

    – Martijn Pieters
    Nov 23 '18 at 14:53











  • github.com/torvalds/linux

    – Max
    Nov 23 '18 at 14:53











  • @Max: The history or contributor list is too large to list contributors for this repository via the API." and a 403 forbidden. Clone that repository and get the count from git directly.

    – Martijn Pieters
    Nov 23 '18 at 15:09











  • Alright, in my case it's easier to use the following git command since I download the repositories anyway. But in the near future, the easiest way is probably to use the new GitHub API V4 (developer.github.com/v4). git rev-list --all --count

    – Max
    Nov 23 '18 at 15:21
















0












0








0







GitHub is blocking your repeated requests. Do not scrape sites in quick succession, many website operators actively block too many requests. As a result, the content that is returned no longer matches your XPath query.



You should be using the REST API that GitHub provides to retrieve project stats like the number of contributors, and you should implement some kind of rate limiting. There is no need to retrieve the same number 100 times, contributor counts do not change that rapidly.



API responses include information on how many requests you can make in a time window, and you can use conditional requests to only incur rate limit costs when the data actually has changed:



import requests
import time
from urllib.parse import parse_qsl, urlparse

owner, repo = 'tipsy', 'profile-summary-for-github'
github_username = '....'
# token = '....' # optional Github basic auth token
stats = 'https://api.github.com/repos/{}/{}/contributors'

with requests.session() as sess:
# GitHub requests you use your username or appname in the header
sess.headers['User-Agent'] += ' - {}'.format(github_username)
# Consider logging in! You'll get more quota
# sess.auth = (github_username, token)

# start with the first, move to the last when available, include anonymous
last_page = stats.format(owner, repo) + '?per_page=100&page=1&anon=true'

while True:
r = sess.get(last_page)
if r.status_code == requests.codes.not_found:
print("No such repo")
break
if r.status_code == requests.codes.no_content:
print("No contributors, repository is empty")
break
if r.status_code == requests.codes.accepted:
print("Stats not yet ready, retrying")
elif r.status_code == requests.codes.not_modified:
print("Stats not changed")
elif r.ok:
# success! Check for a last page, get that instead of current
# to get accurate count
link_last = r.links.get('last', {}).get('url')
if link_last and r.url != link_last:
last_page = link_last
else:
# this is the last page, report on count
params = dict(parse_qsl(urlparse(r.url).query))
page_num = int(params.get('page', '1'))
per_page = int(params.get('per_page', '100'))
contributor_count = len(r.json()) + (per_page * (page_num - 1))
print("Contributor count:", contributor_count)
# only get us a fresh response next time
sess.headers['If-None-Match'] = r.headers['ETag']

# pace ourselves following the rate limit
window_remaining = int(r.headers['X-RateLimit-Reset']) - time.time()
rate_remaining = int(r.headers['X-RateLimit-Remaining'])
# sleep long enough to honour the rate limit or at least 100 milliseconds
time.sleep(max(window_remaining / rate_remaining, 0.1))


The above uses a requests session object to handle repeated headers and ensure that you get to reuse connections where possible.



A good library such as github3.py (incidentally written by a requests core contributor) will take care of most of those details for you.



If you do want to persist on scraping the site directly, you do take a risk that the site operators block you altogether. Try to take some responsibility by not hammering the site continually.



That means that at the very least, you should honour the Retry-After header that GitHub gives you on 429:



if not r.ok:
print("Received a response other that 200 OK:", r.status_code, r.reason)
retry_after = r.headers.get('Retry-After')
if retry_after is not None:
print("Response included a Retry-After:", retry_after)
time.sleep(int(retry_after))
else:
# parse OK response





share|improve this answer















GitHub is blocking your repeated requests. Do not scrape sites in quick succession, many website operators actively block too many requests. As a result, the content that is returned no longer matches your XPath query.



You should be using the REST API that GitHub provides to retrieve project stats like the number of contributors, and you should implement some kind of rate limiting. There is no need to retrieve the same number 100 times, contributor counts do not change that rapidly.



API responses include information on how many requests you can make in a time window, and you can use conditional requests to only incur rate limit costs when the data actually has changed:



import requests
import time
from urllib.parse import parse_qsl, urlparse

owner, repo = 'tipsy', 'profile-summary-for-github'
github_username = '....'
# token = '....' # optional Github basic auth token
stats = 'https://api.github.com/repos/{}/{}/contributors'

with requests.session() as sess:
# GitHub requests you use your username or appname in the header
sess.headers['User-Agent'] += ' - {}'.format(github_username)
# Consider logging in! You'll get more quota
# sess.auth = (github_username, token)

# start with the first, move to the last when available, include anonymous
last_page = stats.format(owner, repo) + '?per_page=100&page=1&anon=true'

while True:
r = sess.get(last_page)
if r.status_code == requests.codes.not_found:
print("No such repo")
break
if r.status_code == requests.codes.no_content:
print("No contributors, repository is empty")
break
if r.status_code == requests.codes.accepted:
print("Stats not yet ready, retrying")
elif r.status_code == requests.codes.not_modified:
print("Stats not changed")
elif r.ok:
# success! Check for a last page, get that instead of current
# to get accurate count
link_last = r.links.get('last', {}).get('url')
if link_last and r.url != link_last:
last_page = link_last
else:
# this is the last page, report on count
params = dict(parse_qsl(urlparse(r.url).query))
page_num = int(params.get('page', '1'))
per_page = int(params.get('per_page', '100'))
contributor_count = len(r.json()) + (per_page * (page_num - 1))
print("Contributor count:", contributor_count)
# only get us a fresh response next time
sess.headers['If-None-Match'] = r.headers['ETag']

# pace ourselves following the rate limit
window_remaining = int(r.headers['X-RateLimit-Reset']) - time.time()
rate_remaining = int(r.headers['X-RateLimit-Remaining'])
# sleep long enough to honour the rate limit or at least 100 milliseconds
time.sleep(max(window_remaining / rate_remaining, 0.1))


The above uses a requests session object to handle repeated headers and ensure that you get to reuse connections where possible.



A good library such as github3.py (incidentally written by a requests core contributor) will take care of most of those details for you.



If you do want to persist on scraping the site directly, you do take a risk that the site operators block you altogether. Try to take some responsibility by not hammering the site continually.



That means that at the very least, you should honour the Retry-After header that GitHub gives you on 429:



if not r.ok:
print("Received a response other that 200 OK:", r.status_code, r.reason)
retry_after = r.headers.get('Retry-After')
if retry_after is not None:
print("Response included a Retry-After:", retry_after)
time.sleep(int(retry_after))
else:
# parse OK response






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 23 '18 at 14:56

























answered Nov 23 '18 at 13:00









Martijn PietersMartijn Pieters

709k13624772298




709k13624772298













  • This post is from '13, I'm able to get more than 100 by using the loop in my answer. Nevertheless I ran into the problem that I sent too many requests when I was trying to fetch the total amount of commits of the Linux kernel which has ~800.000 commits. I got blocked after 70 iterations in my while loop. The error message was: "API rate limit exceeded for [IP] (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"

    – Max
    Nov 23 '18 at 14:34













  • @Max: The code in my answer works for repos I tested. What's the Linux Kernel repository URL?

    – Martijn Pieters
    Nov 23 '18 at 14:53











  • github.com/torvalds/linux

    – Max
    Nov 23 '18 at 14:53











  • @Max: The history or contributor list is too large to list contributors for this repository via the API." and a 403 forbidden. Clone that repository and get the count from git directly.

    – Martijn Pieters
    Nov 23 '18 at 15:09











  • Alright, in my case it's easier to use the following git command since I download the repositories anyway. But in the near future, the easiest way is probably to use the new GitHub API V4 (developer.github.com/v4). git rev-list --all --count

    – Max
    Nov 23 '18 at 15:21





















  • This post is from '13, I'm able to get more than 100 by using the loop in my answer. Nevertheless I ran into the problem that I sent too many requests when I was trying to fetch the total amount of commits of the Linux kernel which has ~800.000 commits. I got blocked after 70 iterations in my while loop. The error message was: "API rate limit exceeded for [IP] (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"

    – Max
    Nov 23 '18 at 14:34













  • @Max: The code in my answer works for repos I tested. What's the Linux Kernel repository URL?

    – Martijn Pieters
    Nov 23 '18 at 14:53











  • github.com/torvalds/linux

    – Max
    Nov 23 '18 at 14:53











  • @Max: The history or contributor list is too large to list contributors for this repository via the API." and a 403 forbidden. Clone that repository and get the count from git directly.

    – Martijn Pieters
    Nov 23 '18 at 15:09











  • Alright, in my case it's easier to use the following git command since I download the repositories anyway. But in the near future, the easiest way is probably to use the new GitHub API V4 (developer.github.com/v4). git rev-list --all --count

    – Max
    Nov 23 '18 at 15:21



















This post is from '13, I'm able to get more than 100 by using the loop in my answer. Nevertheless I ran into the problem that I sent too many requests when I was trying to fetch the total amount of commits of the Linux kernel which has ~800.000 commits. I got blocked after 70 iterations in my while loop. The error message was: "API rate limit exceeded for [IP] (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"

– Max
Nov 23 '18 at 14:34







This post is from '13, I'm able to get more than 100 by using the loop in my answer. Nevertheless I ran into the problem that I sent too many requests when I was trying to fetch the total amount of commits of the Linux kernel which has ~800.000 commits. I got blocked after 70 iterations in my while loop. The error message was: "API rate limit exceeded for [IP] (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"

– Max
Nov 23 '18 at 14:34















@Max: The code in my answer works for repos I tested. What's the Linux Kernel repository URL?

– Martijn Pieters
Nov 23 '18 at 14:53





@Max: The code in my answer works for repos I tested. What's the Linux Kernel repository URL?

– Martijn Pieters
Nov 23 '18 at 14:53













github.com/torvalds/linux

– Max
Nov 23 '18 at 14:53





github.com/torvalds/linux

– Max
Nov 23 '18 at 14:53













@Max: The history or contributor list is too large to list contributors for this repository via the API." and a 403 forbidden. Clone that repository and get the count from git directly.

– Martijn Pieters
Nov 23 '18 at 15:09





@Max: The history or contributor list is too large to list contributors for this repository via the API." and a 403 forbidden. Clone that repository and get the count from git directly.

– Martijn Pieters
Nov 23 '18 at 15:09













Alright, in my case it's easier to use the following git command since I download the repositories anyway. But in the near future, the easiest way is probably to use the new GitHub API V4 (developer.github.com/v4). git rev-list --all --count

– Max
Nov 23 '18 at 15:21







Alright, in my case it's easier to use the following git command since I download the repositories anyway. But in the near future, the easiest way is probably to use the new GitHub API V4 (developer.github.com/v4). git rev-list --all --count

– Max
Nov 23 '18 at 15:21




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53447036%2findex-out-of-range-when-sending-requests-in-a-loop%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

404 Error Contact Form 7 ajax form submitting

How to know if a Active Directory user can login interactively

TypeError: fit_transform() missing 1 required positional argument: 'X'