Wrapping a web scraper in a RESTful API
The problem I am looking to solve is wrapping a web scraper in a RESTful API such that it can be called programmatically from another application, frontend or microservice. The overall goal is that this piece of code will form one part of a larger application in a microservices architecture.
This program scrapes the Radio Francais International Journal en Francais Facile (RFI JEFF) website for the French transcriptions of their daily news podcast. The web scraper is built using Beautiful Soup, the API is built using Flask Restplus, and the code is packaged in a Docker container.
The code operates as follows:
- The API is called with the desired program (broadcast) date as the data payload
- The web scraper is then called to scrape the appropriate webpage for that given date
- The response of the API call is a transcription of the broadcast in the form of a list of sentences/paragraphs as well as other information
To set expectations, I am a beginner programmer and this is my first independent project. I am seeking a project review and feedback on what I've created so far.
I will include a few sections of code for the web scraper and API only. The full repository can be found here.
You can clone the repo, build the container, and run it to test the program:
git clone https://github.com/25Postcards/rfi_jeff_api
sudo docker build . -t jeff_api:latest
sudo docker run -p 8000:8000 jeff_api
API
from flask_restplus import Namespace, Resource, fields
from core import jeff_scraper
from core import jeff_logger
from core import jeff_validators
api = Namespace('web', description='Operations on the RFI website.')
# This model definition is required so it can be registered to the API docs
transcriptions_model = api.model('Transcription', {
'program_date': fields.String(required=True),
'encoding': fields.String,
'title': fields.String,
'article': fields.List(fields.String),
'url': fields.Url('trans_pd_ep'),
'status_code': fields.String,
'error_message': fields.String
})
@api.route('/transcriptions/<program_date>', endpoint='trans_pd_ep')
@api.param('program_date',
'The program date for the broadcast. Accepted date format DDMMYYYY.')
@api.doc(model=transcriptions_model)
class Transcriptions(Resource):
"""A Transcriptions resource.
"""
def get(self, program_date):
"""Gets the transcription from the scrapper.
Args:
program_date (str): A string representing the program date
Returns:
validate_date_errors (dict): A dict of errors raised by the
validator for the transcriptions schema.
data (dict): A dict of attributes from the jeff transcriptions object
containing the program date, title, article, etc. (see schema).
"""
# Create validator, validate input
ts = jeff_validators.TranscriptionsSchema()
validate_date_errors = ts.validate({'program_date': program_date})
if validate_date_errors:
return validate_date_errors
# Create scrapper, scrape page
jt = jeff_scraper.JeffTranscription(program_date)
# Serialise JeffTranscription object to serialised (formatted) dict
# according to Transcriptions Schema
data, errors = ts.dump(jt)
return data
Web scraper
import requests
import logging
from bs4 import BeautifulSoup
from core import jeff_errors
from core import jeff_logger
class JeffTranscription(object):
"""Represents a transcription from the rfi jeff website.
Attributes:
program_date (str): A string for the program date of the broadcast,
accepted date format DDMMYYYY.
title (str): A string for the title of the transcription.
article (list(str)): A list of strings for each paragraph in the
transcription article.
encoding (str): A string defining the encoding of the transcription.
is_error (bool): A boolean indicating if an error occurred.
error_message (str): A string for error messages generated whilst requesting
the webpage or whilst parsing the content.
status_code (str): A string indicating the http status code for responses
from the rfi jeff website.
url (str): The URL for the rfi jeff webpage for the transcription.
"""
def __init__(self, program_date):
"""Inits JeffTranscription with the program date."""
self.program_date = program_date
self._makeURL()
self.title = None
self.article = None
self.encoding = None
self.is_error = False
self.error_message = None
self.status_code = None
self._jeff_scraper_logger = jeff_logger.JeffLogger('jeff_scraper_logger')
try:
page_response = self._getPageResponse()
page_content = page_response.content
self._scrapePage(page_content)
except (jeff_errors.ScraperConnectionError, jeff_errors.ScraperTimeoutError,
jeff_errors.ScraperHTTPError, jeff_errors.ScraperParserError) as e:
self._handleScraperErrors(e)
def _handleScraperErrors(self, e):
"""Handles errors raised by the methods.
Sets the is_error and error_message attributes.
Args:
e: An error object raised by the class methods
"""
self.is_error = True
self.error_message = e.message
self._jeff_scraper_logger.logger.error(self.error_message)
def _makeURL(self):
"""Makes the url for the RFI JEFF Website."""
RFI_JEFF_BASE_URL = 'https://savoirs.rfi.fr/fr/apprendre-enseigner/'
'langue-francaise/journal-en-francais-facile-'
RFI_JEFF_END_URL = '-20h00-gmt'
self.url = RFI_JEFF_BASE_URL + self.program_date + RFI_JEFF_END_URL
def _getPageResponse(self):
"""Gets the response from the webpage.
Returns:
A requests.response object.
Raises:
ScraperConnectionError: A connection error occurred.
ScraperTimeoutError: A timeout error occurred.
ScraperHTTPError: An HTTP Error occurred.
"""
try:
HEADERS = {'User-Agent': 'Chrome/61.0.3163.91'}
page_response = requests.get(self.url, headers=HEADERS, timeout=5)
self._jeff_scraper_logger.logger.info('Request sent to JEFF URL')
self.status_code = page_response.status_code
self.encoding = page_response.encoding
page_response.raise_for_status()
except requests.exceptions.ConnectionError as e:
raise jeff_errors.ScraperConnectionError from e
except requests.exceptions.Timeout as e:
raise jeff_errors.ScraperTimeoutError from e
except requests.exceptions.HTTPError as e:
raise jeff_errors.ScraperHTTPError from e
return page_response
def _scrapePage(self, page_content):
"""Parses the html content from the webpage response.
Uses the Beautiful Soup library to parse the webpage content to extract
the title of the broadcast and the transcription. The transcription
is a series of paragraph elements within an article element that has a
class attribute defined by ARTICLE_CLASS.
Example Webpage HTML
view-source:https://savoirs.rfi.fr/fr/apprendre-enseigner/langue-francaise/journal-en-francais-facile-23072018-20h00-gmt
Article element at line 518
First paragraph element at line 532
Args:
page_content: The content of the webpage as HTML
Raises:
ScraperParserError: An error occurred parsing the page content.
"""
try:
ARTICLE_CLASS = "node node-edition node-edition-full term-2707 clearfix"
bs = BeautifulSoup(page_content, "html.parser")
title_tag = bs.find('title')
title_text = title_tag.get_text()
# Find all the p elements within the article element that has the
# ARTICLE_CLASS class attribute. Remove newline characters and
# unwanted unicode characters from the p element's text fields.
# Create a list of strings, one list element for each paragraph.
article_p_tags = bs.find('article', class_=ARTICLE_CLASS).find_all('p')
article_p_text = [p_tag.get_text().replace('n', '').replace(u'xa0', u'')
for p_tag in article_p_tags]
self._jeff_scraper_logger.logger.info('Page content parsed')
self.title = title_text
self.article = article_p_text
except Exception as e:
raise jeff_errors.ScraperParserError() from e
Notes/Questions:
- I've stuck to the Google Python style guide where possible.
- The scraping is performed when the scraper is instantiated. Is this good practice or should I create a method to initiate or invoke the scraping code?
- The
_getPageResponse
method contains atry
/except
to handle HTTP request errors. Is is correct to use atry
/except
with a function or method to catch exceptions within the function? - The
_getPageResponse
method is called from within another try/except block within theinit
method, is it good practice to havetry
/except
withintry
/except
that are within theinit
methods? The same concern arises with the_scrapePage
method. - Should I be using Flask Restplus to generate swagger API specs/docs automatically? Or should I use a separate extension to generate swagger API specs/docs with Flask Restful (if so, what would this swagger extension be)?
python web-scraping api beautifulsoup flask
add a comment |
The problem I am looking to solve is wrapping a web scraper in a RESTful API such that it can be called programmatically from another application, frontend or microservice. The overall goal is that this piece of code will form one part of a larger application in a microservices architecture.
This program scrapes the Radio Francais International Journal en Francais Facile (RFI JEFF) website for the French transcriptions of their daily news podcast. The web scraper is built using Beautiful Soup, the API is built using Flask Restplus, and the code is packaged in a Docker container.
The code operates as follows:
- The API is called with the desired program (broadcast) date as the data payload
- The web scraper is then called to scrape the appropriate webpage for that given date
- The response of the API call is a transcription of the broadcast in the form of a list of sentences/paragraphs as well as other information
To set expectations, I am a beginner programmer and this is my first independent project. I am seeking a project review and feedback on what I've created so far.
I will include a few sections of code for the web scraper and API only. The full repository can be found here.
You can clone the repo, build the container, and run it to test the program:
git clone https://github.com/25Postcards/rfi_jeff_api
sudo docker build . -t jeff_api:latest
sudo docker run -p 8000:8000 jeff_api
API
from flask_restplus import Namespace, Resource, fields
from core import jeff_scraper
from core import jeff_logger
from core import jeff_validators
api = Namespace('web', description='Operations on the RFI website.')
# This model definition is required so it can be registered to the API docs
transcriptions_model = api.model('Transcription', {
'program_date': fields.String(required=True),
'encoding': fields.String,
'title': fields.String,
'article': fields.List(fields.String),
'url': fields.Url('trans_pd_ep'),
'status_code': fields.String,
'error_message': fields.String
})
@api.route('/transcriptions/<program_date>', endpoint='trans_pd_ep')
@api.param('program_date',
'The program date for the broadcast. Accepted date format DDMMYYYY.')
@api.doc(model=transcriptions_model)
class Transcriptions(Resource):
"""A Transcriptions resource.
"""
def get(self, program_date):
"""Gets the transcription from the scrapper.
Args:
program_date (str): A string representing the program date
Returns:
validate_date_errors (dict): A dict of errors raised by the
validator for the transcriptions schema.
data (dict): A dict of attributes from the jeff transcriptions object
containing the program date, title, article, etc. (see schema).
"""
# Create validator, validate input
ts = jeff_validators.TranscriptionsSchema()
validate_date_errors = ts.validate({'program_date': program_date})
if validate_date_errors:
return validate_date_errors
# Create scrapper, scrape page
jt = jeff_scraper.JeffTranscription(program_date)
# Serialise JeffTranscription object to serialised (formatted) dict
# according to Transcriptions Schema
data, errors = ts.dump(jt)
return data
Web scraper
import requests
import logging
from bs4 import BeautifulSoup
from core import jeff_errors
from core import jeff_logger
class JeffTranscription(object):
"""Represents a transcription from the rfi jeff website.
Attributes:
program_date (str): A string for the program date of the broadcast,
accepted date format DDMMYYYY.
title (str): A string for the title of the transcription.
article (list(str)): A list of strings for each paragraph in the
transcription article.
encoding (str): A string defining the encoding of the transcription.
is_error (bool): A boolean indicating if an error occurred.
error_message (str): A string for error messages generated whilst requesting
the webpage or whilst parsing the content.
status_code (str): A string indicating the http status code for responses
from the rfi jeff website.
url (str): The URL for the rfi jeff webpage for the transcription.
"""
def __init__(self, program_date):
"""Inits JeffTranscription with the program date."""
self.program_date = program_date
self._makeURL()
self.title = None
self.article = None
self.encoding = None
self.is_error = False
self.error_message = None
self.status_code = None
self._jeff_scraper_logger = jeff_logger.JeffLogger('jeff_scraper_logger')
try:
page_response = self._getPageResponse()
page_content = page_response.content
self._scrapePage(page_content)
except (jeff_errors.ScraperConnectionError, jeff_errors.ScraperTimeoutError,
jeff_errors.ScraperHTTPError, jeff_errors.ScraperParserError) as e:
self._handleScraperErrors(e)
def _handleScraperErrors(self, e):
"""Handles errors raised by the methods.
Sets the is_error and error_message attributes.
Args:
e: An error object raised by the class methods
"""
self.is_error = True
self.error_message = e.message
self._jeff_scraper_logger.logger.error(self.error_message)
def _makeURL(self):
"""Makes the url for the RFI JEFF Website."""
RFI_JEFF_BASE_URL = 'https://savoirs.rfi.fr/fr/apprendre-enseigner/'
'langue-francaise/journal-en-francais-facile-'
RFI_JEFF_END_URL = '-20h00-gmt'
self.url = RFI_JEFF_BASE_URL + self.program_date + RFI_JEFF_END_URL
def _getPageResponse(self):
"""Gets the response from the webpage.
Returns:
A requests.response object.
Raises:
ScraperConnectionError: A connection error occurred.
ScraperTimeoutError: A timeout error occurred.
ScraperHTTPError: An HTTP Error occurred.
"""
try:
HEADERS = {'User-Agent': 'Chrome/61.0.3163.91'}
page_response = requests.get(self.url, headers=HEADERS, timeout=5)
self._jeff_scraper_logger.logger.info('Request sent to JEFF URL')
self.status_code = page_response.status_code
self.encoding = page_response.encoding
page_response.raise_for_status()
except requests.exceptions.ConnectionError as e:
raise jeff_errors.ScraperConnectionError from e
except requests.exceptions.Timeout as e:
raise jeff_errors.ScraperTimeoutError from e
except requests.exceptions.HTTPError as e:
raise jeff_errors.ScraperHTTPError from e
return page_response
def _scrapePage(self, page_content):
"""Parses the html content from the webpage response.
Uses the Beautiful Soup library to parse the webpage content to extract
the title of the broadcast and the transcription. The transcription
is a series of paragraph elements within an article element that has a
class attribute defined by ARTICLE_CLASS.
Example Webpage HTML
view-source:https://savoirs.rfi.fr/fr/apprendre-enseigner/langue-francaise/journal-en-francais-facile-23072018-20h00-gmt
Article element at line 518
First paragraph element at line 532
Args:
page_content: The content of the webpage as HTML
Raises:
ScraperParserError: An error occurred parsing the page content.
"""
try:
ARTICLE_CLASS = "node node-edition node-edition-full term-2707 clearfix"
bs = BeautifulSoup(page_content, "html.parser")
title_tag = bs.find('title')
title_text = title_tag.get_text()
# Find all the p elements within the article element that has the
# ARTICLE_CLASS class attribute. Remove newline characters and
# unwanted unicode characters from the p element's text fields.
# Create a list of strings, one list element for each paragraph.
article_p_tags = bs.find('article', class_=ARTICLE_CLASS).find_all('p')
article_p_text = [p_tag.get_text().replace('n', '').replace(u'xa0', u'')
for p_tag in article_p_tags]
self._jeff_scraper_logger.logger.info('Page content parsed')
self.title = title_text
self.article = article_p_text
except Exception as e:
raise jeff_errors.ScraperParserError() from e
Notes/Questions:
- I've stuck to the Google Python style guide where possible.
- The scraping is performed when the scraper is instantiated. Is this good practice or should I create a method to initiate or invoke the scraping code?
- The
_getPageResponse
method contains atry
/except
to handle HTTP request errors. Is is correct to use atry
/except
with a function or method to catch exceptions within the function? - The
_getPageResponse
method is called from within another try/except block within theinit
method, is it good practice to havetry
/except
withintry
/except
that are within theinit
methods? The same concern arises with the_scrapePage
method. - Should I be using Flask Restplus to generate swagger API specs/docs automatically? Or should I use a separate extension to generate swagger API specs/docs with Flask Restful (if so, what would this swagger extension be)?
python web-scraping api beautifulsoup flask
add a comment |
The problem I am looking to solve is wrapping a web scraper in a RESTful API such that it can be called programmatically from another application, frontend or microservice. The overall goal is that this piece of code will form one part of a larger application in a microservices architecture.
This program scrapes the Radio Francais International Journal en Francais Facile (RFI JEFF) website for the French transcriptions of their daily news podcast. The web scraper is built using Beautiful Soup, the API is built using Flask Restplus, and the code is packaged in a Docker container.
The code operates as follows:
- The API is called with the desired program (broadcast) date as the data payload
- The web scraper is then called to scrape the appropriate webpage for that given date
- The response of the API call is a transcription of the broadcast in the form of a list of sentences/paragraphs as well as other information
To set expectations, I am a beginner programmer and this is my first independent project. I am seeking a project review and feedback on what I've created so far.
I will include a few sections of code for the web scraper and API only. The full repository can be found here.
You can clone the repo, build the container, and run it to test the program:
git clone https://github.com/25Postcards/rfi_jeff_api
sudo docker build . -t jeff_api:latest
sudo docker run -p 8000:8000 jeff_api
API
from flask_restplus import Namespace, Resource, fields
from core import jeff_scraper
from core import jeff_logger
from core import jeff_validators
api = Namespace('web', description='Operations on the RFI website.')
# This model definition is required so it can be registered to the API docs
transcriptions_model = api.model('Transcription', {
'program_date': fields.String(required=True),
'encoding': fields.String,
'title': fields.String,
'article': fields.List(fields.String),
'url': fields.Url('trans_pd_ep'),
'status_code': fields.String,
'error_message': fields.String
})
@api.route('/transcriptions/<program_date>', endpoint='trans_pd_ep')
@api.param('program_date',
'The program date for the broadcast. Accepted date format DDMMYYYY.')
@api.doc(model=transcriptions_model)
class Transcriptions(Resource):
"""A Transcriptions resource.
"""
def get(self, program_date):
"""Gets the transcription from the scrapper.
Args:
program_date (str): A string representing the program date
Returns:
validate_date_errors (dict): A dict of errors raised by the
validator for the transcriptions schema.
data (dict): A dict of attributes from the jeff transcriptions object
containing the program date, title, article, etc. (see schema).
"""
# Create validator, validate input
ts = jeff_validators.TranscriptionsSchema()
validate_date_errors = ts.validate({'program_date': program_date})
if validate_date_errors:
return validate_date_errors
# Create scrapper, scrape page
jt = jeff_scraper.JeffTranscription(program_date)
# Serialise JeffTranscription object to serialised (formatted) dict
# according to Transcriptions Schema
data, errors = ts.dump(jt)
return data
Web scraper
import requests
import logging
from bs4 import BeautifulSoup
from core import jeff_errors
from core import jeff_logger
class JeffTranscription(object):
"""Represents a transcription from the rfi jeff website.
Attributes:
program_date (str): A string for the program date of the broadcast,
accepted date format DDMMYYYY.
title (str): A string for the title of the transcription.
article (list(str)): A list of strings for each paragraph in the
transcription article.
encoding (str): A string defining the encoding of the transcription.
is_error (bool): A boolean indicating if an error occurred.
error_message (str): A string for error messages generated whilst requesting
the webpage or whilst parsing the content.
status_code (str): A string indicating the http status code for responses
from the rfi jeff website.
url (str): The URL for the rfi jeff webpage for the transcription.
"""
def __init__(self, program_date):
"""Inits JeffTranscription with the program date."""
self.program_date = program_date
self._makeURL()
self.title = None
self.article = None
self.encoding = None
self.is_error = False
self.error_message = None
self.status_code = None
self._jeff_scraper_logger = jeff_logger.JeffLogger('jeff_scraper_logger')
try:
page_response = self._getPageResponse()
page_content = page_response.content
self._scrapePage(page_content)
except (jeff_errors.ScraperConnectionError, jeff_errors.ScraperTimeoutError,
jeff_errors.ScraperHTTPError, jeff_errors.ScraperParserError) as e:
self._handleScraperErrors(e)
def _handleScraperErrors(self, e):
"""Handles errors raised by the methods.
Sets the is_error and error_message attributes.
Args:
e: An error object raised by the class methods
"""
self.is_error = True
self.error_message = e.message
self._jeff_scraper_logger.logger.error(self.error_message)
def _makeURL(self):
"""Makes the url for the RFI JEFF Website."""
RFI_JEFF_BASE_URL = 'https://savoirs.rfi.fr/fr/apprendre-enseigner/'
'langue-francaise/journal-en-francais-facile-'
RFI_JEFF_END_URL = '-20h00-gmt'
self.url = RFI_JEFF_BASE_URL + self.program_date + RFI_JEFF_END_URL
def _getPageResponse(self):
"""Gets the response from the webpage.
Returns:
A requests.response object.
Raises:
ScraperConnectionError: A connection error occurred.
ScraperTimeoutError: A timeout error occurred.
ScraperHTTPError: An HTTP Error occurred.
"""
try:
HEADERS = {'User-Agent': 'Chrome/61.0.3163.91'}
page_response = requests.get(self.url, headers=HEADERS, timeout=5)
self._jeff_scraper_logger.logger.info('Request sent to JEFF URL')
self.status_code = page_response.status_code
self.encoding = page_response.encoding
page_response.raise_for_status()
except requests.exceptions.ConnectionError as e:
raise jeff_errors.ScraperConnectionError from e
except requests.exceptions.Timeout as e:
raise jeff_errors.ScraperTimeoutError from e
except requests.exceptions.HTTPError as e:
raise jeff_errors.ScraperHTTPError from e
return page_response
def _scrapePage(self, page_content):
"""Parses the html content from the webpage response.
Uses the Beautiful Soup library to parse the webpage content to extract
the title of the broadcast and the transcription. The transcription
is a series of paragraph elements within an article element that has a
class attribute defined by ARTICLE_CLASS.
Example Webpage HTML
view-source:https://savoirs.rfi.fr/fr/apprendre-enseigner/langue-francaise/journal-en-francais-facile-23072018-20h00-gmt
Article element at line 518
First paragraph element at line 532
Args:
page_content: The content of the webpage as HTML
Raises:
ScraperParserError: An error occurred parsing the page content.
"""
try:
ARTICLE_CLASS = "node node-edition node-edition-full term-2707 clearfix"
bs = BeautifulSoup(page_content, "html.parser")
title_tag = bs.find('title')
title_text = title_tag.get_text()
# Find all the p elements within the article element that has the
# ARTICLE_CLASS class attribute. Remove newline characters and
# unwanted unicode characters from the p element's text fields.
# Create a list of strings, one list element for each paragraph.
article_p_tags = bs.find('article', class_=ARTICLE_CLASS).find_all('p')
article_p_text = [p_tag.get_text().replace('n', '').replace(u'xa0', u'')
for p_tag in article_p_tags]
self._jeff_scraper_logger.logger.info('Page content parsed')
self.title = title_text
self.article = article_p_text
except Exception as e:
raise jeff_errors.ScraperParserError() from e
Notes/Questions:
- I've stuck to the Google Python style guide where possible.
- The scraping is performed when the scraper is instantiated. Is this good practice or should I create a method to initiate or invoke the scraping code?
- The
_getPageResponse
method contains atry
/except
to handle HTTP request errors. Is is correct to use atry
/except
with a function or method to catch exceptions within the function? - The
_getPageResponse
method is called from within another try/except block within theinit
method, is it good practice to havetry
/except
withintry
/except
that are within theinit
methods? The same concern arises with the_scrapePage
method. - Should I be using Flask Restplus to generate swagger API specs/docs automatically? Or should I use a separate extension to generate swagger API specs/docs with Flask Restful (if so, what would this swagger extension be)?
python web-scraping api beautifulsoup flask
The problem I am looking to solve is wrapping a web scraper in a RESTful API such that it can be called programmatically from another application, frontend or microservice. The overall goal is that this piece of code will form one part of a larger application in a microservices architecture.
This program scrapes the Radio Francais International Journal en Francais Facile (RFI JEFF) website for the French transcriptions of their daily news podcast. The web scraper is built using Beautiful Soup, the API is built using Flask Restplus, and the code is packaged in a Docker container.
The code operates as follows:
- The API is called with the desired program (broadcast) date as the data payload
- The web scraper is then called to scrape the appropriate webpage for that given date
- The response of the API call is a transcription of the broadcast in the form of a list of sentences/paragraphs as well as other information
To set expectations, I am a beginner programmer and this is my first independent project. I am seeking a project review and feedback on what I've created so far.
I will include a few sections of code for the web scraper and API only. The full repository can be found here.
You can clone the repo, build the container, and run it to test the program:
git clone https://github.com/25Postcards/rfi_jeff_api
sudo docker build . -t jeff_api:latest
sudo docker run -p 8000:8000 jeff_api
API
from flask_restplus import Namespace, Resource, fields
from core import jeff_scraper
from core import jeff_logger
from core import jeff_validators
api = Namespace('web', description='Operations on the RFI website.')
# This model definition is required so it can be registered to the API docs
transcriptions_model = api.model('Transcription', {
'program_date': fields.String(required=True),
'encoding': fields.String,
'title': fields.String,
'article': fields.List(fields.String),
'url': fields.Url('trans_pd_ep'),
'status_code': fields.String,
'error_message': fields.String
})
@api.route('/transcriptions/<program_date>', endpoint='trans_pd_ep')
@api.param('program_date',
'The program date for the broadcast. Accepted date format DDMMYYYY.')
@api.doc(model=transcriptions_model)
class Transcriptions(Resource):
"""A Transcriptions resource.
"""
def get(self, program_date):
"""Gets the transcription from the scrapper.
Args:
program_date (str): A string representing the program date
Returns:
validate_date_errors (dict): A dict of errors raised by the
validator for the transcriptions schema.
data (dict): A dict of attributes from the jeff transcriptions object
containing the program date, title, article, etc. (see schema).
"""
# Create validator, validate input
ts = jeff_validators.TranscriptionsSchema()
validate_date_errors = ts.validate({'program_date': program_date})
if validate_date_errors:
return validate_date_errors
# Create scrapper, scrape page
jt = jeff_scraper.JeffTranscription(program_date)
# Serialise JeffTranscription object to serialised (formatted) dict
# according to Transcriptions Schema
data, errors = ts.dump(jt)
return data
Web scraper
import requests
import logging
from bs4 import BeautifulSoup
from core import jeff_errors
from core import jeff_logger
class JeffTranscription(object):
"""Represents a transcription from the rfi jeff website.
Attributes:
program_date (str): A string for the program date of the broadcast,
accepted date format DDMMYYYY.
title (str): A string for the title of the transcription.
article (list(str)): A list of strings for each paragraph in the
transcription article.
encoding (str): A string defining the encoding of the transcription.
is_error (bool): A boolean indicating if an error occurred.
error_message (str): A string for error messages generated whilst requesting
the webpage or whilst parsing the content.
status_code (str): A string indicating the http status code for responses
from the rfi jeff website.
url (str): The URL for the rfi jeff webpage for the transcription.
"""
def __init__(self, program_date):
"""Inits JeffTranscription with the program date."""
self.program_date = program_date
self._makeURL()
self.title = None
self.article = None
self.encoding = None
self.is_error = False
self.error_message = None
self.status_code = None
self._jeff_scraper_logger = jeff_logger.JeffLogger('jeff_scraper_logger')
try:
page_response = self._getPageResponse()
page_content = page_response.content
self._scrapePage(page_content)
except (jeff_errors.ScraperConnectionError, jeff_errors.ScraperTimeoutError,
jeff_errors.ScraperHTTPError, jeff_errors.ScraperParserError) as e:
self._handleScraperErrors(e)
def _handleScraperErrors(self, e):
"""Handles errors raised by the methods.
Sets the is_error and error_message attributes.
Args:
e: An error object raised by the class methods
"""
self.is_error = True
self.error_message = e.message
self._jeff_scraper_logger.logger.error(self.error_message)
def _makeURL(self):
"""Makes the url for the RFI JEFF Website."""
RFI_JEFF_BASE_URL = 'https://savoirs.rfi.fr/fr/apprendre-enseigner/'
'langue-francaise/journal-en-francais-facile-'
RFI_JEFF_END_URL = '-20h00-gmt'
self.url = RFI_JEFF_BASE_URL + self.program_date + RFI_JEFF_END_URL
def _getPageResponse(self):
"""Gets the response from the webpage.
Returns:
A requests.response object.
Raises:
ScraperConnectionError: A connection error occurred.
ScraperTimeoutError: A timeout error occurred.
ScraperHTTPError: An HTTP Error occurred.
"""
try:
HEADERS = {'User-Agent': 'Chrome/61.0.3163.91'}
page_response = requests.get(self.url, headers=HEADERS, timeout=5)
self._jeff_scraper_logger.logger.info('Request sent to JEFF URL')
self.status_code = page_response.status_code
self.encoding = page_response.encoding
page_response.raise_for_status()
except requests.exceptions.ConnectionError as e:
raise jeff_errors.ScraperConnectionError from e
except requests.exceptions.Timeout as e:
raise jeff_errors.ScraperTimeoutError from e
except requests.exceptions.HTTPError as e:
raise jeff_errors.ScraperHTTPError from e
return page_response
def _scrapePage(self, page_content):
"""Parses the html content from the webpage response.
Uses the Beautiful Soup library to parse the webpage content to extract
the title of the broadcast and the transcription. The transcription
is a series of paragraph elements within an article element that has a
class attribute defined by ARTICLE_CLASS.
Example Webpage HTML
view-source:https://savoirs.rfi.fr/fr/apprendre-enseigner/langue-francaise/journal-en-francais-facile-23072018-20h00-gmt
Article element at line 518
First paragraph element at line 532
Args:
page_content: The content of the webpage as HTML
Raises:
ScraperParserError: An error occurred parsing the page content.
"""
try:
ARTICLE_CLASS = "node node-edition node-edition-full term-2707 clearfix"
bs = BeautifulSoup(page_content, "html.parser")
title_tag = bs.find('title')
title_text = title_tag.get_text()
# Find all the p elements within the article element that has the
# ARTICLE_CLASS class attribute. Remove newline characters and
# unwanted unicode characters from the p element's text fields.
# Create a list of strings, one list element for each paragraph.
article_p_tags = bs.find('article', class_=ARTICLE_CLASS).find_all('p')
article_p_text = [p_tag.get_text().replace('n', '').replace(u'xa0', u'')
for p_tag in article_p_tags]
self._jeff_scraper_logger.logger.info('Page content parsed')
self.title = title_text
self.article = article_p_text
except Exception as e:
raise jeff_errors.ScraperParserError() from e
Notes/Questions:
- I've stuck to the Google Python style guide where possible.
- The scraping is performed when the scraper is instantiated. Is this good practice or should I create a method to initiate or invoke the scraping code?
- The
_getPageResponse
method contains atry
/except
to handle HTTP request errors. Is is correct to use atry
/except
with a function or method to catch exceptions within the function? - The
_getPageResponse
method is called from within another try/except block within theinit
method, is it good practice to havetry
/except
withintry
/except
that are within theinit
methods? The same concern arises with the_scrapePage
method. - Should I be using Flask Restplus to generate swagger API specs/docs automatically? Or should I use a separate extension to generate swagger API specs/docs with Flask Restful (if so, what would this swagger extension be)?
python web-scraping api beautifulsoup flask
python web-scraping api beautifulsoup flask
edited 21 mins ago
Jamal♦
30.2k11116226
30.2k11116226
asked Oct 15 at 22:03
25Postcards
112
112
add a comment |
add a comment |
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f205639%2fwrapping-a-web-scraper-in-a-restful-api%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f205639%2fwrapping-a-web-scraper-in-a-restful-api%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown