Index out of range when sending requests in a loop

-1

I encounter an index out of range error when I try to get the number of contributors of a GitHub project in a loop. After some iterations (which are working perfectly) it just throws that exception. I have no clue why ...

    for x in range(100):

        r = requests.get('https://github.com/tipsy/profile-summary-for-github')  

        xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'

        contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))

        print(contributors_number) # prints the correct number until the exception

Here's the exception.

----> 4     contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))

IndexError: list index out of range

asked Nov 23 '18 at 12:48

Max

467

3

You are almost certainly being blocked by GitHub. Use their API, don't scrape the site.

– Martijn Pieters♦
Nov 23 '18 at 12:55

add a comment |

-1

    for x in range(100):

        r = requests.get('https://github.com/tipsy/profile-summary-for-github')  

        xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'

        contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))

        print(contributors_number) # prints the correct number until the exception

Here's the exception.

----> 4     contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))

IndexError: list index out of range

asked Nov 23 '18 at 12:48

Max

467

3

You are almost certainly being blocked by GitHub. Use their API, don't scrape the site.

– Martijn Pieters♦
Nov 23 '18 at 12:55

add a comment |

-1

    for x in range(100):

        r = requests.get('https://github.com/tipsy/profile-summary-for-github')  

        xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'

        contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))

        print(contributors_number) # prints the correct number until the exception

Here's the exception.

----> 4     contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))

IndexError: list index out of range

asked Nov 23 '18 at 12:48

Max

467

    for x in range(100):

        r = requests.get('https://github.com/tipsy/profile-summary-for-github')  

        xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'

        contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))

        print(contributors_number) # prints the correct number until the exception

Here's the exception.

----> 4     contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))

IndexError: list index out of range

python indexoutofrangeexception

asked Nov 23 '18 at 12:48

Max

467

asked Nov 23 '18 at 12:48

Max

467

asked Nov 23 '18 at 12:48

Max

467

asked Nov 23 '18 at 12:48

Max

467

asked Nov 23 '18 at 12:48

Max

467

3

You are almost certainly being blocked by GitHub. Use their API, don't scrape the site.

– Martijn Pieters♦
Nov 23 '18 at 12:55

add a comment |

3

You are almost certainly being blocked by GitHub. Use their API, don't scrape the site.

– Martijn Pieters♦
Nov 23 '18 at 12:55

You are almost certainly being blocked by GitHub. Use their API, don't scrape the site.

– Martijn Pieters♦
Nov 23 '18 at 12:55

add a comment |

3 Answers
3

active

oldest

votes

It seems likely that you're getting a 429 - Too many requests since you're firing requests of one after the other.

You might want to modify your code as such:

import time



for index in range(100):

    r = requests.get('https://github.com/tipsy/profile-summary-for-github')  

    xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'

    contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))

    print(contributors_number)

    time.sleep(3) # Wait a bit before firing of another request

Better yet would be:

import time



for index in range(100):

    r = requests.get('https://github.com/tipsy/profile-summary-for-github')

    if r.status_code in [200]:  # Check if the request was successful  

        xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'

        contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))

        print(contributors_number)

    else:

        print("Failed fetching page, status code: " + str(r.status_code))

    time.sleep(3) # Wait a bit before firing of another request

edited Nov 23 '18 at 12:58

answered Nov 23 '18 at 12:54

alexisdevarennes

1,96721229

They should be using the GitHub API rather than scrape the human-readable side.

– Martijn Pieters♦
Nov 23 '18 at 12:55

Agreed, but if they need help with understanding why this simple script fails, I doubt they're up to date on using APIs.

– alexisdevarennes
Nov 23 '18 at 12:56

This is the case indeed. I was able to reproduce the issue and got the following response back after a couple of requests: jsfiddle.net/48v3rt6o (You have triggered an abuse detection mechanism. Please wait a few minutes before you try again.)

– Matias Cicero
Nov 23 '18 at 12:57

@Max: contributor information is available just fine: developer.github.com/v3/repos/collaborators

– Martijn Pieters♦
Nov 23 '18 at 13:01

1

@Max: those limitations are there for a reason. Trying to bypass limitations set by GitHub just means they'll limit you from accessing altogether. This is not a game you can win.

– Martijn Pieters♦
Nov 23 '18 at 13:05

|
show 7 more comments

Now this works perfectly for me while using the API. Probably the cleanest way of doing it.

import requests

import json



url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'

response = requests.get(url)

commits = json.loads(response.text)



commits_total = len(commits)

page_number = 1

while(len(commits) == 100):

    page_number += 1

    url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'+'&page='+str(page_number)

    response = requests.get(url)

    commits = json.loads(response.text)

    commits_total += len(commits)

answered Nov 23 '18 at 13:46

Max

467

Don't guess at the page numbers. Instead, use response.links['next']['url'] (if set) or go straight to response.links['last']['url'].

– Martijn Pieters♦
Nov 23 '18 at 16:02

Yes, this code is indeed not perfect, but I won't work at it any further.

– Max
Nov 23 '18 at 16:19

add a comment |

GitHub is blocking your repeated requests. Do not scrape sites in quick succession, many website operators actively block too many requests. As a result, the content that is returned no longer matches your XPath query.

You should be using the REST API that GitHub provides to retrieve project stats like the number of contributors, and you should implement some kind of rate limiting. There is no need to retrieve the same number 100 times, contributor counts do not change that rapidly.

API responses include information on how many requests you can make in a time window, and you can use conditional requests to only incur rate limit costs when the data actually has changed:

import requests

import time

from urllib.parse import parse_qsl, urlparse



owner, repo = 'tipsy', 'profile-summary-for-github'

github_username = '....'

# token = '....'   # optional Github basic auth token

stats = 'https://api.github.com/repos/{}/{}/contributors'



with requests.session() as sess:

    # GitHub requests you use your username or appname in the header

    sess.headers['User-Agent'] += ' - {}'.format(github_username)

    # Consider logging in! You'll get more quota

    # sess.auth = (github_username, token)



    # start with the first, move to the last when available, include anonymous

    last_page = stats.format(owner, repo) + '?per_page=100&page=1&anon=true'



    while True:

        r = sess.get(last_page)

        if r.status_code == requests.codes.not_found:

            print("No such repo")

            break

        if r.status_code == requests.codes.no_content:

            print("No contributors, repository is empty")

            break

        if r.status_code == requests.codes.accepted:

            print("Stats not yet ready, retrying")

        elif r.status_code == requests.codes.not_modified:

            print("Stats not changed")

        elif r.ok:

            # success! Check for a last page, get that instead of current

            # to get accurate count

            link_last = r.links.get('last', {}).get('url')

            if link_last and r.url != link_last:

                last_page = link_last

            else:

                # this is the last page, report on count

                params = dict(parse_qsl(urlparse(r.url).query))

                page_num = int(params.get('page', '1'))

                per_page = int(params.get('per_page', '100'))

                contributor_count = len(r.json()) + (per_page * (page_num - 1))

                print("Contributor count:", contributor_count)

            # only get us a fresh response next time

            sess.headers['If-None-Match'] = r.headers['ETag']



        # pace ourselves following the rate limit

        window_remaining = int(r.headers['X-RateLimit-Reset']) - time.time()

        rate_remaining = int(r.headers['X-RateLimit-Remaining'])

        # sleep long enough to honour the rate limit or at least 100 milliseconds

        time.sleep(max(window_remaining / rate_remaining, 0.1))

The above uses a requests session object to handle repeated headers and ensure that you get to reuse connections where possible.

A good library such as github3.py (incidentally written by a requests core contributor) will take care of most of those details for you.

If you do want to persist on scraping the site directly, you do take a risk that the site operators block you altogether. Try to take some responsibility by not hammering the site continually.

That means that at the very least, you should honour the Retry-After header that GitHub gives you on 429:

if not r.ok:

    print("Received a response other that 200 OK:", r.status_code, r.reason)

    retry_after = r.headers.get('Retry-After')

    if retry_after is not None:

        print("Response included a Retry-After:", retry_after)

        time.sleep(int(retry_after))

else:

    # parse OK response

edited Nov 23 '18 at 14:56

answered Nov 23 '18 at 13:00

Martijn Pieters♦

709k13624772298

This post is from '13, I'm able to get more than 100 by using the loop in my answer. Nevertheless I ran into the problem that I sent too many requests when I was trying to fetch the total amount of commits of the Linux kernel which has ~800.000 commits. I got blocked after 70 iterations in my while loop. The error message was: "API rate limit exceeded for [IP] (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"

– Max
Nov 23 '18 at 14:34

@Max: The code in my answer works for repos I tested. What's the Linux Kernel repository URL?

– Martijn Pieters♦
Nov 23 '18 at 14:53

github.com/torvalds/linux

– Max
Nov 23 '18 at 14:53

@Max: The history or contributor list is too large to list contributors for this repository via the API." and a 403 forbidden. Clone that repository and get the count from git directly.

– Martijn Pieters♦
Nov 23 '18 at 15:09

Alright, in my case it's easier to use the following git command since I download the repositories anyway. But in the near future, the easiest way is probably to use the new GitHub API V4 (developer.github.com/v4). git rev-list --all --count

– Max
Nov 23 '18 at 15:21

|
show 3 more comments

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53447036%2findex-out-of-range-when-sending-requests-in-a-loop%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

It seems likely that you're getting a 429 - Too many requests since you're firing requests of one after the other.

You might want to modify your code as such:

import time



for index in range(100):

    r = requests.get('https://github.com/tipsy/profile-summary-for-github')  

    xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'

    contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))

    print(contributors_number)

    time.sleep(3) # Wait a bit before firing of another request

Better yet would be:

import time



for index in range(100):

    r = requests.get('https://github.com/tipsy/profile-summary-for-github')

    if r.status_code in [200]:  # Check if the request was successful  

        xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'

        contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))

        print(contributors_number)

    else:

        print("Failed fetching page, status code: " + str(r.status_code))

    time.sleep(3) # Wait a bit before firing of another request

edited Nov 23 '18 at 12:58

answered Nov 23 '18 at 12:54

alexisdevarennes

1,96721229

They should be using the GitHub API rather than scrape the human-readable side.

– Martijn Pieters♦
Nov 23 '18 at 12:55

Agreed, but if they need help with understanding why this simple script fails, I doubt they're up to date on using APIs.

– alexisdevarennes
Nov 23 '18 at 12:56

This is the case indeed. I was able to reproduce the issue and got the following response back after a couple of requests: jsfiddle.net/48v3rt6o (You have triggered an abuse detection mechanism. Please wait a few minutes before you try again.)

– Matias Cicero
Nov 23 '18 at 12:57

@Max: contributor information is available just fine: developer.github.com/v3/repos/collaborators

– Martijn Pieters♦
Nov 23 '18 at 13:01

1

@Max: those limitations are there for a reason. Trying to bypass limitations set by GitHub just means they'll limit you from accessing altogether. This is not a game you can win.

– Martijn Pieters♦
Nov 23 '18 at 13:05

|
show 7 more comments

It seems likely that you're getting a 429 - Too many requests since you're firing requests of one after the other.

You might want to modify your code as such:

import time



for index in range(100):

    r = requests.get('https://github.com/tipsy/profile-summary-for-github')  

    xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'

    contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))

    print(contributors_number)

    time.sleep(3) # Wait a bit before firing of another request

Better yet would be:

import time



for index in range(100):

    r = requests.get('https://github.com/tipsy/profile-summary-for-github')

    if r.status_code in [200]:  # Check if the request was successful  

        xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'

        contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))

        print(contributors_number)

    else:

        print("Failed fetching page, status code: " + str(r.status_code))

    time.sleep(3) # Wait a bit before firing of another request

edited Nov 23 '18 at 12:58

answered Nov 23 '18 at 12:54

alexisdevarennes

1,96721229

They should be using the GitHub API rather than scrape the human-readable side.

– Martijn Pieters♦
Nov 23 '18 at 12:55

Agreed, but if they need help with understanding why this simple script fails, I doubt they're up to date on using APIs.

– alexisdevarennes
Nov 23 '18 at 12:56

This is the case indeed. I was able to reproduce the issue and got the following response back after a couple of requests: jsfiddle.net/48v3rt6o (You have triggered an abuse detection mechanism. Please wait a few minutes before you try again.)

– Matias Cicero
Nov 23 '18 at 12:57

@Max: contributor information is available just fine: developer.github.com/v3/repos/collaborators

– Martijn Pieters♦
Nov 23 '18 at 13:01

1

@Max: those limitations are there for a reason. Trying to bypass limitations set by GitHub just means they'll limit you from accessing altogether. This is not a game you can win.

– Martijn Pieters♦
Nov 23 '18 at 13:05

|
show 7 more comments

It seems likely that you're getting a 429 - Too many requests since you're firing requests of one after the other.

You might want to modify your code as such:

import time



for index in range(100):

    r = requests.get('https://github.com/tipsy/profile-summary-for-github')  

    xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'

    contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))

    print(contributors_number)

    time.sleep(3) # Wait a bit before firing of another request

Better yet would be:

import time



for index in range(100):

    r = requests.get('https://github.com/tipsy/profile-summary-for-github')

    if r.status_code in [200]:  # Check if the request was successful  

        xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'

        contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))

        print(contributors_number)

    else:

        print("Failed fetching page, status code: " + str(r.status_code))

    time.sleep(3) # Wait a bit before firing of another request

edited Nov 23 '18 at 12:58

answered Nov 23 '18 at 12:54

alexisdevarennes

1,96721229

It seems likely that you're getting a 429 - Too many requests since you're firing requests of one after the other.

You might want to modify your code as such:

import time



for index in range(100):

    r = requests.get('https://github.com/tipsy/profile-summary-for-github')  

    xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'

    contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))

    print(contributors_number)

    time.sleep(3) # Wait a bit before firing of another request

Better yet would be:

import time



for index in range(100):

    r = requests.get('https://github.com/tipsy/profile-summary-for-github')

    if r.status_code in [200]:  # Check if the request was successful  

        xpath = '//span[contains(@class, "num") and following-sibling::text()[normalize-space()="contributors"]]/text()'

        contributors_number = int(html.fromstring(r.text).xpath(xpath)[0].strip().replace(',', ''))

        print(contributors_number)

    else:

        print("Failed fetching page, status code: " + str(r.status_code))

    time.sleep(3) # Wait a bit before firing of another request

edited Nov 23 '18 at 12:58

answered Nov 23 '18 at 12:54

alexisdevarennes

1,96721229

edited Nov 23 '18 at 12:58

answered Nov 23 '18 at 12:54

alexisdevarennes

1,96721229

answered Nov 23 '18 at 12:54

alexisdevarennes

1,96721229

answered Nov 23 '18 at 12:54

alexisdevarennes

1,96721229

They should be using the GitHub API rather than scrape the human-readable side.

– Martijn Pieters♦
Nov 23 '18 at 12:55

Agreed, but if they need help with understanding why this simple script fails, I doubt they're up to date on using APIs.

– alexisdevarennes
Nov 23 '18 at 12:56

This is the case indeed. I was able to reproduce the issue and got the following response back after a couple of requests: jsfiddle.net/48v3rt6o (You have triggered an abuse detection mechanism. Please wait a few minutes before you try again.)

– Matias Cicero
Nov 23 '18 at 12:57

@Max: contributor information is available just fine: developer.github.com/v3/repos/collaborators

– Martijn Pieters♦
Nov 23 '18 at 13:01

1

@Max: those limitations are there for a reason. Trying to bypass limitations set by GitHub just means they'll limit you from accessing altogether. This is not a game you can win.

– Martijn Pieters♦
Nov 23 '18 at 13:05

|
show 7 more comments

They should be using the GitHub API rather than scrape the human-readable side.

– Martijn Pieters♦
Nov 23 '18 at 12:55

Agreed, but if they need help with understanding why this simple script fails, I doubt they're up to date on using APIs.

– alexisdevarennes
Nov 23 '18 at 12:56

This is the case indeed. I was able to reproduce the issue and got the following response back after a couple of requests: jsfiddle.net/48v3rt6o (You have triggered an abuse detection mechanism. Please wait a few minutes before you try again.)

– Matias Cicero
Nov 23 '18 at 12:57

@Max: contributor information is available just fine: developer.github.com/v3/repos/collaborators

– Martijn Pieters♦
Nov 23 '18 at 13:01

1

@Max: those limitations are there for a reason. Trying to bypass limitations set by GitHub just means they'll limit you from accessing altogether. This is not a game you can win.

– Martijn Pieters♦
Nov 23 '18 at 13:05

They should be using the GitHub API rather than scrape the human-readable side.

– Martijn Pieters♦
Nov 23 '18 at 12:55

Agreed, but if they need help with understanding why this simple script fails, I doubt they're up to date on using APIs.

– alexisdevarennes
Nov 23 '18 at 12:56

This is the case indeed. I was able to reproduce the issue and got the following response back after a couple of requests: jsfiddle.net/48v3rt6o (You have triggered an abuse detection mechanism. Please wait a few minutes before you try again.)

– Matias Cicero
Nov 23 '18 at 12:57

@Max: contributor information is available just fine: developer.github.com/v3/repos/collaborators

– Martijn Pieters♦
Nov 23 '18 at 13:01

@Max: those limitations are there for a reason. Trying to bypass limitations set by GitHub just means they'll limit you from accessing altogether. This is not a game you can win.

– Martijn Pieters♦
Nov 23 '18 at 13:05

|
show 7 more comments

Now this works perfectly for me while using the API. Probably the cleanest way of doing it.

import requests

import json



url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'

response = requests.get(url)

commits = json.loads(response.text)



commits_total = len(commits)

page_number = 1

while(len(commits) == 100):

    page_number += 1

    url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'+'&page='+str(page_number)

    response = requests.get(url)

    commits = json.loads(response.text)

    commits_total += len(commits)

answered Nov 23 '18 at 13:46

Max

467

Don't guess at the page numbers. Instead, use response.links['next']['url'] (if set) or go straight to response.links['last']['url'].

– Martijn Pieters♦
Nov 23 '18 at 16:02

Yes, this code is indeed not perfect, but I won't work at it any further.

– Max
Nov 23 '18 at 16:19

add a comment |

Now this works perfectly for me while using the API. Probably the cleanest way of doing it.

import requests

import json



url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'

response = requests.get(url)

commits = json.loads(response.text)



commits_total = len(commits)

page_number = 1

while(len(commits) == 100):

    page_number += 1

    url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'+'&page='+str(page_number)

    response = requests.get(url)

    commits = json.loads(response.text)

    commits_total += len(commits)

answered Nov 23 '18 at 13:46

Max

467

Don't guess at the page numbers. Instead, use response.links['next']['url'] (if set) or go straight to response.links['last']['url'].

– Martijn Pieters♦
Nov 23 '18 at 16:02

Yes, this code is indeed not perfect, but I won't work at it any further.

– Max
Nov 23 '18 at 16:19

add a comment |

Now this works perfectly for me while using the API. Probably the cleanest way of doing it.

import requests

import json



url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'

response = requests.get(url)

commits = json.loads(response.text)



commits_total = len(commits)

page_number = 1

while(len(commits) == 100):

    page_number += 1

    url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'+'&page='+str(page_number)

    response = requests.get(url)

    commits = json.loads(response.text)

    commits_total += len(commits)

answered Nov 23 '18 at 13:46

Max

467

Now this works perfectly for me while using the API. Probably the cleanest way of doing it.

import requests

import json



url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'

response = requests.get(url)

commits = json.loads(response.text)



commits_total = len(commits)

page_number = 1

while(len(commits) == 100):

    page_number += 1

    url = 'https://api.github.com/repos/valentinxxx/nginxconfig.io/commits?&per_page=100'+'&page='+str(page_number)

    response = requests.get(url)

    commits = json.loads(response.text)

    commits_total += len(commits)

answered Nov 23 '18 at 13:46

Max

467

answered Nov 23 '18 at 13:46

Max

467

answered Nov 23 '18 at 13:46

Max

467

answered Nov 23 '18 at 13:46

Max

467

Don't guess at the page numbers. Instead, use response.links['next']['url'] (if set) or go straight to response.links['last']['url'].

– Martijn Pieters♦
Nov 23 '18 at 16:02

Yes, this code is indeed not perfect, but I won't work at it any further.

– Max
Nov 23 '18 at 16:19

add a comment |

Don't guess at the page numbers. Instead, use response.links['next']['url'] (if set) or go straight to response.links['last']['url'].

– Martijn Pieters♦
Nov 23 '18 at 16:02

Yes, this code is indeed not perfect, but I won't work at it any further.

– Max
Nov 23 '18 at 16:19

Don't guess at the page numbers. Instead, use response.links['next']['url'] (if set) or go straight to response.links['last']['url'].

– Martijn Pieters♦
Nov 23 '18 at 16:02

Yes, this code is indeed not perfect, but I won't work at it any further.

– Max
Nov 23 '18 at 16:19

add a comment |

API responses include information on how many requests you can make in a time window, and you can use conditional requests to only incur rate limit costs when the data actually has changed:

import requests

import time

from urllib.parse import parse_qsl, urlparse



owner, repo = 'tipsy', 'profile-summary-for-github'

github_username = '....'

# token = '....'   # optional Github basic auth token

stats = 'https://api.github.com/repos/{}/{}/contributors'



with requests.session() as sess:

    # GitHub requests you use your username or appname in the header

    sess.headers['User-Agent'] += ' - {}'.format(github_username)

    # Consider logging in! You'll get more quota

    # sess.auth = (github_username, token)



    # start with the first, move to the last when available, include anonymous

    last_page = stats.format(owner, repo) + '?per_page=100&page=1&anon=true'



    while True:

        r = sess.get(last_page)

        if r.status_code == requests.codes.not_found:

            print("No such repo")

            break

        if r.status_code == requests.codes.no_content:

            print("No contributors, repository is empty")

            break

        if r.status_code == requests.codes.accepted:

            print("Stats not yet ready, retrying")

        elif r.status_code == requests.codes.not_modified:

            print("Stats not changed")

        elif r.ok:

            # success! Check for a last page, get that instead of current

            # to get accurate count

            link_last = r.links.get('last', {}).get('url')

            if link_last and r.url != link_last:

                last_page = link_last

            else:

                # this is the last page, report on count

                params = dict(parse_qsl(urlparse(r.url).query))

                page_num = int(params.get('page', '1'))

                per_page = int(params.get('per_page', '100'))

                contributor_count = len(r.json()) + (per_page * (page_num - 1))

                print("Contributor count:", contributor_count)

            # only get us a fresh response next time

            sess.headers['If-None-Match'] = r.headers['ETag']



        # pace ourselves following the rate limit

        window_remaining = int(r.headers['X-RateLimit-Reset']) - time.time()

        rate_remaining = int(r.headers['X-RateLimit-Remaining'])

        # sleep long enough to honour the rate limit or at least 100 milliseconds

        time.sleep(max(window_remaining / rate_remaining, 0.1))

The above uses a requests session object to handle repeated headers and ensure that you get to reuse connections where possible.

A good library such as github3.py (incidentally written by a requests core contributor) will take care of most of those details for you.

If you do want to persist on scraping the site directly, you do take a risk that the site operators block you altogether. Try to take some responsibility by not hammering the site continually.

That means that at the very least, you should honour the Retry-After header that GitHub gives you on 429:

if not r.ok:

    print("Received a response other that 200 OK:", r.status_code, r.reason)

    retry_after = r.headers.get('Retry-After')

    if retry_after is not None:

        print("Response included a Retry-After:", retry_after)

        time.sleep(int(retry_after))

else:

    # parse OK response

edited Nov 23 '18 at 14:56

answered Nov 23 '18 at 13:00

Martijn Pieters♦

709k13624772298

This post is from '13, I'm able to get more than 100 by using the loop in my answer. Nevertheless I ran into the problem that I sent too many requests when I was trying to fetch the total amount of commits of the Linux kernel which has ~800.000 commits. I got blocked after 70 iterations in my while loop. The error message was: "API rate limit exceeded for [IP] (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"

– Max
Nov 23 '18 at 14:34

@Max: The code in my answer works for repos I tested. What's the Linux Kernel repository URL?

– Martijn Pieters♦
Nov 23 '18 at 14:53

github.com/torvalds/linux

– Max
Nov 23 '18 at 14:53

@Max: The history or contributor list is too large to list contributors for this repository via the API." and a 403 forbidden. Clone that repository and get the count from git directly.

– Martijn Pieters♦
Nov 23 '18 at 15:09

Alright, in my case it's easier to use the following git command since I download the repositories anyway. But in the near future, the easiest way is probably to use the new GitHub API V4 (developer.github.com/v4). git rev-list --all --count

– Max
Nov 23 '18 at 15:21

|
show 3 more comments

API responses include information on how many requests you can make in a time window, and you can use conditional requests to only incur rate limit costs when the data actually has changed:

import requests

import time

from urllib.parse import parse_qsl, urlparse



owner, repo = 'tipsy', 'profile-summary-for-github'

github_username = '....'

# token = '....'   # optional Github basic auth token

stats = 'https://api.github.com/repos/{}/{}/contributors'



with requests.session() as sess:

    # GitHub requests you use your username or appname in the header

    sess.headers['User-Agent'] += ' - {}'.format(github_username)

    # Consider logging in! You'll get more quota

    # sess.auth = (github_username, token)



    # start with the first, move to the last when available, include anonymous

    last_page = stats.format(owner, repo) + '?per_page=100&page=1&anon=true'



    while True:

        r = sess.get(last_page)

        if r.status_code == requests.codes.not_found:

            print("No such repo")

            break

        if r.status_code == requests.codes.no_content:

            print("No contributors, repository is empty")

            break

        if r.status_code == requests.codes.accepted:

            print("Stats not yet ready, retrying")

        elif r.status_code == requests.codes.not_modified:

            print("Stats not changed")

        elif r.ok:

            # success! Check for a last page, get that instead of current

            # to get accurate count

            link_last = r.links.get('last', {}).get('url')

            if link_last and r.url != link_last:

                last_page = link_last

            else:

                # this is the last page, report on count

                params = dict(parse_qsl(urlparse(r.url).query))

                page_num = int(params.get('page', '1'))

                per_page = int(params.get('per_page', '100'))

                contributor_count = len(r.json()) + (per_page * (page_num - 1))

                print("Contributor count:", contributor_count)

            # only get us a fresh response next time

            sess.headers['If-None-Match'] = r.headers['ETag']



        # pace ourselves following the rate limit

        window_remaining = int(r.headers['X-RateLimit-Reset']) - time.time()

        rate_remaining = int(r.headers['X-RateLimit-Remaining'])

        # sleep long enough to honour the rate limit or at least 100 milliseconds

        time.sleep(max(window_remaining / rate_remaining, 0.1))

The above uses a requests session object to handle repeated headers and ensure that you get to reuse connections where possible.

A good library such as github3.py (incidentally written by a requests core contributor) will take care of most of those details for you.

If you do want to persist on scraping the site directly, you do take a risk that the site operators block you altogether. Try to take some responsibility by not hammering the site continually.

That means that at the very least, you should honour the Retry-After header that GitHub gives you on 429:

if not r.ok:

    print("Received a response other that 200 OK:", r.status_code, r.reason)

    retry_after = r.headers.get('Retry-After')

    if retry_after is not None:

        print("Response included a Retry-After:", retry_after)

        time.sleep(int(retry_after))

else:

    # parse OK response

edited Nov 23 '18 at 14:56

answered Nov 23 '18 at 13:00

Martijn Pieters♦

709k13624772298

This post is from '13, I'm able to get more than 100 by using the loop in my answer. Nevertheless I ran into the problem that I sent too many requests when I was trying to fetch the total amount of commits of the Linux kernel which has ~800.000 commits. I got blocked after 70 iterations in my while loop. The error message was: "API rate limit exceeded for [IP] (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"

– Max
Nov 23 '18 at 14:34

@Max: The code in my answer works for repos I tested. What's the Linux Kernel repository URL?

– Martijn Pieters♦
Nov 23 '18 at 14:53

github.com/torvalds/linux

– Max
Nov 23 '18 at 14:53

@Max: The history or contributor list is too large to list contributors for this repository via the API." and a 403 forbidden. Clone that repository and get the count from git directly.

– Martijn Pieters♦
Nov 23 '18 at 15:09

Alright, in my case it's easier to use the following git command since I download the repositories anyway. But in the near future, the easiest way is probably to use the new GitHub API V4 (developer.github.com/v4). git rev-list --all --count

– Max
Nov 23 '18 at 15:21

|
show 3 more comments

API responses include information on how many requests you can make in a time window, and you can use conditional requests to only incur rate limit costs when the data actually has changed:

import requests

import time

from urllib.parse import parse_qsl, urlparse



owner, repo = 'tipsy', 'profile-summary-for-github'

github_username = '....'

# token = '....'   # optional Github basic auth token

stats = 'https://api.github.com/repos/{}/{}/contributors'



with requests.session() as sess:

    # GitHub requests you use your username or appname in the header

    sess.headers['User-Agent'] += ' - {}'.format(github_username)

    # Consider logging in! You'll get more quota

    # sess.auth = (github_username, token)



    # start with the first, move to the last when available, include anonymous

    last_page = stats.format(owner, repo) + '?per_page=100&page=1&anon=true'



    while True:

        r = sess.get(last_page)

        if r.status_code == requests.codes.not_found:

            print("No such repo")

            break

        if r.status_code == requests.codes.no_content:

            print("No contributors, repository is empty")

            break

        if r.status_code == requests.codes.accepted:

            print("Stats not yet ready, retrying")

        elif r.status_code == requests.codes.not_modified:

            print("Stats not changed")

        elif r.ok:

            # success! Check for a last page, get that instead of current

            # to get accurate count

            link_last = r.links.get('last', {}).get('url')

            if link_last and r.url != link_last:

                last_page = link_last

            else:

                # this is the last page, report on count

                params = dict(parse_qsl(urlparse(r.url).query))

                page_num = int(params.get('page', '1'))

                per_page = int(params.get('per_page', '100'))

                contributor_count = len(r.json()) + (per_page * (page_num - 1))

                print("Contributor count:", contributor_count)

            # only get us a fresh response next time

            sess.headers['If-None-Match'] = r.headers['ETag']



        # pace ourselves following the rate limit

        window_remaining = int(r.headers['X-RateLimit-Reset']) - time.time()

        rate_remaining = int(r.headers['X-RateLimit-Remaining'])

        # sleep long enough to honour the rate limit or at least 100 milliseconds

        time.sleep(max(window_remaining / rate_remaining, 0.1))

The above uses a requests session object to handle repeated headers and ensure that you get to reuse connections where possible.

A good library such as github3.py (incidentally written by a requests core contributor) will take care of most of those details for you.

If you do want to persist on scraping the site directly, you do take a risk that the site operators block you altogether. Try to take some responsibility by not hammering the site continually.

That means that at the very least, you should honour the Retry-After header that GitHub gives you on 429:

if not r.ok:

    print("Received a response other that 200 OK:", r.status_code, r.reason)

    retry_after = r.headers.get('Retry-After')

    if retry_after is not None:

        print("Response included a Retry-After:", retry_after)

        time.sleep(int(retry_after))

else:

    # parse OK response

edited Nov 23 '18 at 14:56

answered Nov 23 '18 at 13:00

Martijn Pieters♦

709k13624772298

API responses include information on how many requests you can make in a time window, and you can use conditional requests to only incur rate limit costs when the data actually has changed:

import requests

import time

from urllib.parse import parse_qsl, urlparse



owner, repo = 'tipsy', 'profile-summary-for-github'

github_username = '....'

# token = '....'   # optional Github basic auth token

stats = 'https://api.github.com/repos/{}/{}/contributors'



with requests.session() as sess:

    # GitHub requests you use your username or appname in the header

    sess.headers['User-Agent'] += ' - {}'.format(github_username)

    # Consider logging in! You'll get more quota

    # sess.auth = (github_username, token)



    # start with the first, move to the last when available, include anonymous

    last_page = stats.format(owner, repo) + '?per_page=100&page=1&anon=true'



    while True:

        r = sess.get(last_page)

        if r.status_code == requests.codes.not_found:

            print("No such repo")

            break

        if r.status_code == requests.codes.no_content:

            print("No contributors, repository is empty")

            break

        if r.status_code == requests.codes.accepted:

            print("Stats not yet ready, retrying")

        elif r.status_code == requests.codes.not_modified:

            print("Stats not changed")

        elif r.ok:

            # success! Check for a last page, get that instead of current

            # to get accurate count

            link_last = r.links.get('last', {}).get('url')

            if link_last and r.url != link_last:

                last_page = link_last

            else:

                # this is the last page, report on count

                params = dict(parse_qsl(urlparse(r.url).query))

                page_num = int(params.get('page', '1'))

                per_page = int(params.get('per_page', '100'))

                contributor_count = len(r.json()) + (per_page * (page_num - 1))

                print("Contributor count:", contributor_count)

            # only get us a fresh response next time

            sess.headers['If-None-Match'] = r.headers['ETag']



        # pace ourselves following the rate limit

        window_remaining = int(r.headers['X-RateLimit-Reset']) - time.time()

        rate_remaining = int(r.headers['X-RateLimit-Remaining'])

        # sleep long enough to honour the rate limit or at least 100 milliseconds

        time.sleep(max(window_remaining / rate_remaining, 0.1))

The above uses a requests session object to handle repeated headers and ensure that you get to reuse connections where possible.

A good library such as github3.py (incidentally written by a requests core contributor) will take care of most of those details for you.

If you do want to persist on scraping the site directly, you do take a risk that the site operators block you altogether. Try to take some responsibility by not hammering the site continually.

That means that at the very least, you should honour the Retry-After header that GitHub gives you on 429:

if not r.ok:

    print("Received a response other that 200 OK:", r.status_code, r.reason)

    retry_after = r.headers.get('Retry-After')

    if retry_after is not None:

        print("Response included a Retry-After:", retry_after)

        time.sleep(int(retry_after))

else:

    # parse OK response

edited Nov 23 '18 at 14:56

answered Nov 23 '18 at 13:00

Martijn Pieters♦

709k13624772298

edited Nov 23 '18 at 14:56

answered Nov 23 '18 at 13:00

Martijn Pieters♦

709k13624772298

answered Nov 23 '18 at 13:00

Martijn Pieters♦

709k13624772298

answered Nov 23 '18 at 13:00

Martijn Pieters♦

709k13624772298

This post is from '13, I'm able to get more than 100 by using the loop in my answer. Nevertheless I ran into the problem that I sent too many requests when I was trying to fetch the total amount of commits of the Linux kernel which has ~800.000 commits. I got blocked after 70 iterations in my while loop. The error message was: "API rate limit exceeded for [IP] (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"

– Max
Nov 23 '18 at 14:34

@Max: The code in my answer works for repos I tested. What's the Linux Kernel repository URL?

– Martijn Pieters♦
Nov 23 '18 at 14:53

github.com/torvalds/linux

– Max
Nov 23 '18 at 14:53

@Max: The history or contributor list is too large to list contributors for this repository via the API." and a 403 forbidden. Clone that repository and get the count from git directly.

– Martijn Pieters♦
Nov 23 '18 at 15:09

Alright, in my case it's easier to use the following git command since I download the repositories anyway. But in the near future, the easiest way is probably to use the new GitHub API V4 (developer.github.com/v4). git rev-list --all --count

– Max
Nov 23 '18 at 15:21

|
show 3 more comments

This post is from '13, I'm able to get more than 100 by using the loop in my answer. Nevertheless I ran into the problem that I sent too many requests when I was trying to fetch the total amount of commits of the Linux kernel which has ~800.000 commits. I got blocked after 70 iterations in my while loop. The error message was: "API rate limit exceeded for [IP] (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"

– Max
Nov 23 '18 at 14:34

@Max: The code in my answer works for repos I tested. What's the Linux Kernel repository URL?

– Martijn Pieters♦
Nov 23 '18 at 14:53

github.com/torvalds/linux

– Max
Nov 23 '18 at 14:53

@Max: The history or contributor list is too large to list contributors for this repository via the API." and a 403 forbidden. Clone that repository and get the count from git directly.

– Martijn Pieters♦
Nov 23 '18 at 15:09

Alright, in my case it's easier to use the following git command since I download the repositories anyway. But in the near future, the easiest way is probably to use the new GitHub API V4 (developer.github.com/v4). git rev-list --all --count

– Max
Nov 23 '18 at 15:21

This post is from '13, I'm able to get more than 100 by using the loop in my answer. Nevertheless I ran into the problem that I sent too many requests when I was trying to fetch the total amount of commits of the Linux kernel which has ~800.000 commits. I got blocked after 70 iterations in my while loop. The error message was: "API rate limit exceeded for [IP] (But here's the good news: Authenticated requests get a higher rate limit. Check out the documentation for more details.)"

– Max
Nov 23 '18 at 14:34

@Max: The code in my answer works for repos I tested. What's the Linux Kernel repository URL?

– Martijn Pieters♦
Nov 23 '18 at 14:53

github.com/torvalds/linux

– Max
Nov 23 '18 at 14:53

@Max: The history or contributor list is too large to list contributors for this repository via the API." and a 403 forbidden. Clone that repository and get the count from git directly.

– Martijn Pieters♦
Nov 23 '18 at 15:09

Alright, in my case it's easier to use the following git command since I download the repositories anyway. But in the near future, the easiest way is probably to use the new GitHub API V4 (developer.github.com/v4). git rev-list --all --count

– Max
Nov 23 '18 at 15:21

|
show 3 more comments

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Tukukkk