Twitter scraper using tweepy
up vote
0
down vote
favorite
I wrote a Twitter scraper using the tweepy so I can scrape user information and tweets. Given that the free API doesn't let me get the number of messages per tweet, I had to rely on BeautifulSoup to do so.
class TweetAPI():
def __init__(self, k1,k2,k3,k4):
self.key = k1
self.secret_key = k2
self.token = k3
self.secret_token = k4
auth = tweepy.OAuthHandler(self.key, self.secret_key)
auth.set_access_token(self.token, self.secret_token)
api = tweepy.API(auth, wait_on_rate_limit=True)
self.api = api
def tweet_getter(self, user_id, n):
api = self.api
tweets =
try:
for tweet in tweepy.Cursor(self.api.user_timeline, id = user_id).items(n):
url = "https://twitter.com/{}/status/{}".format(user_id, tweet.id_str)
page = requests.get(url)
page = BeautifulSoup(page.content, 'html.parser')
message_count = int(page.find('span',{"class":"ProfileTweet-actionCount"}).text.strip().split()[0])
temp = [user_id, tweet.created_at, tweet.id_str,
tweet.favorite_count, tweet.retweet_count, message_count, tweet.text]
tweets.append(temp)
return tweets
except:
print("Unable to get user @{} tweets".format(user_id))
pass
I'm worried about two things:
Does the
wait_on_rate_limit=True
really prevent the over doing request in terms of limits?Should I add an artificial delayer to the in the BeautifulSoup part in order to avoid getting blocked off from the page content of the Twitter website?
python python-3.x beautifulsoup twitter
New contributor
add a comment |
up vote
0
down vote
favorite
I wrote a Twitter scraper using the tweepy so I can scrape user information and tweets. Given that the free API doesn't let me get the number of messages per tweet, I had to rely on BeautifulSoup to do so.
class TweetAPI():
def __init__(self, k1,k2,k3,k4):
self.key = k1
self.secret_key = k2
self.token = k3
self.secret_token = k4
auth = tweepy.OAuthHandler(self.key, self.secret_key)
auth.set_access_token(self.token, self.secret_token)
api = tweepy.API(auth, wait_on_rate_limit=True)
self.api = api
def tweet_getter(self, user_id, n):
api = self.api
tweets =
try:
for tweet in tweepy.Cursor(self.api.user_timeline, id = user_id).items(n):
url = "https://twitter.com/{}/status/{}".format(user_id, tweet.id_str)
page = requests.get(url)
page = BeautifulSoup(page.content, 'html.parser')
message_count = int(page.find('span',{"class":"ProfileTweet-actionCount"}).text.strip().split()[0])
temp = [user_id, tweet.created_at, tweet.id_str,
tweet.favorite_count, tweet.retweet_count, message_count, tweet.text]
tweets.append(temp)
return tweets
except:
print("Unable to get user @{} tweets".format(user_id))
pass
I'm worried about two things:
Does the
wait_on_rate_limit=True
really prevent the over doing request in terms of limits?Should I add an artificial delayer to the in the BeautifulSoup part in order to avoid getting blocked off from the page content of the Twitter website?
python python-3.x beautifulsoup twitter
New contributor
Why are you using the API but then sending a request to the user-facing page and parsing the HTML for the tweet? This could break if twitter changes that page. You should already have the info you need coming back from the API (or if not, there should be an API endpoint to request it).
– Bailey Parker
1 min ago
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I wrote a Twitter scraper using the tweepy so I can scrape user information and tweets. Given that the free API doesn't let me get the number of messages per tweet, I had to rely on BeautifulSoup to do so.
class TweetAPI():
def __init__(self, k1,k2,k3,k4):
self.key = k1
self.secret_key = k2
self.token = k3
self.secret_token = k4
auth = tweepy.OAuthHandler(self.key, self.secret_key)
auth.set_access_token(self.token, self.secret_token)
api = tweepy.API(auth, wait_on_rate_limit=True)
self.api = api
def tweet_getter(self, user_id, n):
api = self.api
tweets =
try:
for tweet in tweepy.Cursor(self.api.user_timeline, id = user_id).items(n):
url = "https://twitter.com/{}/status/{}".format(user_id, tweet.id_str)
page = requests.get(url)
page = BeautifulSoup(page.content, 'html.parser')
message_count = int(page.find('span',{"class":"ProfileTweet-actionCount"}).text.strip().split()[0])
temp = [user_id, tweet.created_at, tweet.id_str,
tweet.favorite_count, tweet.retweet_count, message_count, tweet.text]
tweets.append(temp)
return tweets
except:
print("Unable to get user @{} tweets".format(user_id))
pass
I'm worried about two things:
Does the
wait_on_rate_limit=True
really prevent the over doing request in terms of limits?Should I add an artificial delayer to the in the BeautifulSoup part in order to avoid getting blocked off from the page content of the Twitter website?
python python-3.x beautifulsoup twitter
New contributor
I wrote a Twitter scraper using the tweepy so I can scrape user information and tweets. Given that the free API doesn't let me get the number of messages per tweet, I had to rely on BeautifulSoup to do so.
class TweetAPI():
def __init__(self, k1,k2,k3,k4):
self.key = k1
self.secret_key = k2
self.token = k3
self.secret_token = k4
auth = tweepy.OAuthHandler(self.key, self.secret_key)
auth.set_access_token(self.token, self.secret_token)
api = tweepy.API(auth, wait_on_rate_limit=True)
self.api = api
def tweet_getter(self, user_id, n):
api = self.api
tweets =
try:
for tweet in tweepy.Cursor(self.api.user_timeline, id = user_id).items(n):
url = "https://twitter.com/{}/status/{}".format(user_id, tweet.id_str)
page = requests.get(url)
page = BeautifulSoup(page.content, 'html.parser')
message_count = int(page.find('span',{"class":"ProfileTweet-actionCount"}).text.strip().split()[0])
temp = [user_id, tweet.created_at, tweet.id_str,
tweet.favorite_count, tweet.retweet_count, message_count, tweet.text]
tweets.append(temp)
return tweets
except:
print("Unable to get user @{} tweets".format(user_id))
pass
I'm worried about two things:
Does the
wait_on_rate_limit=True
really prevent the over doing request in terms of limits?Should I add an artificial delayer to the in the BeautifulSoup part in order to avoid getting blocked off from the page content of the Twitter website?
python python-3.x beautifulsoup twitter
python python-3.x beautifulsoup twitter
New contributor
New contributor
edited 5 mins ago
Jamal♦
30.2k11115226
30.2k11115226
New contributor
asked 5 hours ago
Frank Pinto
1
1
New contributor
New contributor
Why are you using the API but then sending a request to the user-facing page and parsing the HTML for the tweet? This could break if twitter changes that page. You should already have the info you need coming back from the API (or if not, there should be an API endpoint to request it).
– Bailey Parker
1 min ago
add a comment |
Why are you using the API but then sending a request to the user-facing page and parsing the HTML for the tweet? This could break if twitter changes that page. You should already have the info you need coming back from the API (or if not, there should be an API endpoint to request it).
– Bailey Parker
1 min ago
Why are you using the API but then sending a request to the user-facing page and parsing the HTML for the tweet? This could break if twitter changes that page. You should already have the info you need coming back from the API (or if not, there should be an API endpoint to request it).
– Bailey Parker
1 min ago
Why are you using the API but then sending a request to the user-facing page and parsing the HTML for the tweet? This could break if twitter changes that page. You should already have the info you need coming back from the API (or if not, there should be an API endpoint to request it).
– Bailey Parker
1 min ago
add a comment |
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Frank Pinto is a new contributor. Be nice, and check out our Code of Conduct.
Frank Pinto is a new contributor. Be nice, and check out our Code of Conduct.
Frank Pinto is a new contributor. Be nice, and check out our Code of Conduct.
Frank Pinto is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f209327%2ftwitter-scraper-using-tweepy%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Why are you using the API but then sending a request to the user-facing page and parsing the HTML for the tweet? This could break if twitter changes that page. You should already have the info you need coming back from the API (or if not, there should be an API endpoint to request it).
– Bailey Parker
1 min ago