Wrapping a web scraper in a RESTful API

The problem I am looking to solve is wrapping a web scraper in a RESTful API such that it can be called programmatically from another application, frontend or microservice. The overall goal is that this piece of code will form one part of a larger application in a microservices architecture.

This program scrapes the Radio Francais International Journal en Francais Facile (RFI JEFF) website for the French transcriptions of their daily news podcast. The web scraper is built using Beautiful Soup, the API is built using Flask Restplus, and the code is packaged in a Docker container.

The code operates as follows:

The API is called with the desired program (broadcast) date as the data payload

The web scraper is then called to scrape the appropriate webpage for that given date

The response of the API call is a transcription of the broadcast in the form of a list of sentences/paragraphs as well as other information

To set expectations, I am a beginner programmer and this is my first independent project. I am seeking a project review and feedback on what I've created so far.

I will include a few sections of code for the web scraper and API only. The full repository can be found here.

You can clone the repo, build the container, and run it to test the program:

git clone https://github.com/25Postcards/rfi_jeff_api

sudo docker build . -t jeff_api:latest

sudo docker run -p 8000:8000 jeff_api

API

from flask_restplus import Namespace, Resource, fields



from core import jeff_scraper

from core import jeff_logger

from core import jeff_validators



api = Namespace('web', description='Operations on the RFI website.')



# This model definition is required so it can be registered to the API docs

transcriptions_model = api.model('Transcription', {

    'program_date': fields.String(required=True),

    'encoding': fields.String,

    'title': fields.String,

    'article': fields.List(fields.String),

    'url': fields.Url('trans_pd_ep'),

    'status_code': fields.String,

    'error_message': fields.String

})



@api.route('/transcriptions/<program_date>', endpoint='trans_pd_ep')

@api.param('program_date',

           'The program date for the broadcast. Accepted date format DDMMYYYY.')

@api.doc(model=transcriptions_model)

class Transcriptions(Resource):

    """A Transcriptions resource.

    """

    def get(self, program_date):

        """Gets the transcription from the scrapper.



        Args:

            program_date (str): A string representing the program date



        Returns:

            validate_date_errors (dict): A dict of errors raised by the

                validator for the transcriptions schema.

            data (dict): A dict of attributes from the jeff transcriptions object

                containing the program date, title, article, etc. (see schema).

        """

        # Create validator, validate input

        ts = jeff_validators.TranscriptionsSchema()

        validate_date_errors = ts.validate({'program_date': program_date})

        if validate_date_errors:

            return validate_date_errors



        # Create scrapper, scrape page

        jt = jeff_scraper.JeffTranscription(program_date)



        # Serialise JeffTranscription object to serialised (formatted) dict

        # according to Transcriptions Schema

        data, errors = ts.dump(jt)

        return data

Web scraper

import requests

import logging



from bs4 import BeautifulSoup



from core import jeff_errors

from core import jeff_logger



class JeffTranscription(object):

    """Represents a transcription from the rfi jeff website.



    Attributes:

        program_date (str): A string for the program date of the broadcast,

            accepted date format DDMMYYYY.

        title (str): A string for the title of the transcription.

        article (list(str)): A list of strings for each paragraph in the

            transcription article.

        encoding (str): A string defining the encoding of the transcription.

        is_error (bool): A boolean indicating if an error occurred.

        error_message (str): A string for error messages generated whilst requesting

            the webpage or whilst parsing the content.

        status_code (str): A string indicating the http status code for responses

            from the rfi jeff website.

        url (str): The URL for the rfi jeff webpage for the transcription.

    """



    def __init__(self, program_date):

        """Inits JeffTranscription with the program date."""

        self.program_date = program_date

        self._makeURL()



        self.title = None

        self.article = None

        self.encoding = None



        self.is_error = False

        self.error_message = None

        self.status_code = None



        self._jeff_scraper_logger = jeff_logger.JeffLogger('jeff_scraper_logger')



        try:

            page_response = self._getPageResponse()

            page_content = page_response.content

            self._scrapePage(page_content)



        except (jeff_errors.ScraperConnectionError, jeff_errors.ScraperTimeoutError,

                jeff_errors.ScraperHTTPError, jeff_errors.ScraperParserError) as e:

                self._handleScraperErrors(e)



    def _handleScraperErrors(self, e):

        """Handles errors raised by the methods.



        Sets the is_error and error_message attributes.



        Args:

            e: An error object raised by the class methods

        """



        self.is_error = True

        self.error_message = e.message

        self._jeff_scraper_logger.logger.error(self.error_message)



    def _makeURL(self):

        """Makes the url for the RFI JEFF Website."""

        RFI_JEFF_BASE_URL = 'https://savoirs.rfi.fr/fr/apprendre-enseigner/' 

                            'langue-francaise/journal-en-francais-facile-'

        RFI_JEFF_END_URL = '-20h00-gmt'

        self.url = RFI_JEFF_BASE_URL + self.program_date + RFI_JEFF_END_URL



    def _getPageResponse(self):

        """Gets the response from the webpage.



        Returns:

            A requests.response object.



        Raises:

            ScraperConnectionError: A connection error occurred.

            ScraperTimeoutError: A timeout error occurred.

            ScraperHTTPError: An HTTP Error occurred.

        """

        try:

            HEADERS = {'User-Agent': 'Chrome/61.0.3163.91'}

            page_response = requests.get(self.url, headers=HEADERS, timeout=5)

            self._jeff_scraper_logger.logger.info('Request sent to JEFF URL')

            self.status_code = page_response.status_code

            self.encoding = page_response.encoding

            page_response.raise_for_status()



        except requests.exceptions.ConnectionError as e:

            raise jeff_errors.ScraperConnectionError from e



        except requests.exceptions.Timeout as e:

            raise jeff_errors.ScraperTimeoutError from e



        except requests.exceptions.HTTPError as e:

            raise jeff_errors.ScraperHTTPError from e



        return page_response



    def _scrapePage(self, page_content):

        """Parses the html content from the webpage response.



        Uses the Beautiful Soup library to parse the webpage content to extract

        the title of the broadcast and the transcription. The transcription

        is a series of paragraph elements within an article element that has a

        class attribute defined by ARTICLE_CLASS.



        Example Webpage HTML

            view-source:https://savoirs.rfi.fr/fr/apprendre-enseigner/langue-francaise/journal-en-francais-facile-23072018-20h00-gmt

            Article element at line 518

            First paragraph element at line 532



        Args:

            page_content: The content of the webpage as HTML



        Raises:

            ScraperParserError: An error occurred parsing the page content.

        """



        try:

            ARTICLE_CLASS = "node node-edition node-edition-full term-2707 clearfix"

            bs = BeautifulSoup(page_content, "html.parser")



            title_tag = bs.find('title')

            title_text = title_tag.get_text()



            # Find all the p elements within the article element that has the

            # ARTICLE_CLASS class attribute. Remove newline characters and

            # unwanted unicode characters from the p element's text fields.

            # Create a list of strings, one list element for each paragraph.

            article_p_tags = bs.find('article', class_=ARTICLE_CLASS).find_all('p')

            article_p_text = [p_tag.get_text().replace('n', '').replace(u'xa0', u'')

                              for p_tag in article_p_tags]



            self._jeff_scraper_logger.logger.info('Page content parsed')



            self.title = title_text

            self.article = article_p_text



        except Exception as e:

            raise jeff_errors.ScraperParserError() from e

Notes/Questions:

I've stuck to the Google Python style guide where possible.

The scraping is performed when the scraper is instantiated. Is this good practice or should I create a method to initiate or invoke the scraping code?

The _getPageResponse method contains a try/except to handle HTTP request errors. Is is correct to use a try/except with a function or method to catch exceptions within the function?

The _getPageResponse method is called from within another try/except block within the init method, is it good practice to have try/except within try/except that are within the init methods? The same concern arises with the _scrapePage method.

Should I be using Flask Restplus to generate swagger API specs/docs automatically? Or should I use a separate extension to generate swagger API specs/docs with Flask Restful (if so, what would this swagger extension be)?

edited 21 mins ago

Jamal♦

30.2k11116226

asked Oct 15 at 22:03

25Postcards

112

add a comment |

The code operates as follows:

The API is called with the desired program (broadcast) date as the data payload

The web scraper is then called to scrape the appropriate webpage for that given date

The response of the API call is a transcription of the broadcast in the form of a list of sentences/paragraphs as well as other information

To set expectations, I am a beginner programmer and this is my first independent project. I am seeking a project review and feedback on what I've created so far.

I will include a few sections of code for the web scraper and API only. The full repository can be found here.

You can clone the repo, build the container, and run it to test the program:

git clone https://github.com/25Postcards/rfi_jeff_api

sudo docker build . -t jeff_api:latest

sudo docker run -p 8000:8000 jeff_api

API

from flask_restplus import Namespace, Resource, fields



from core import jeff_scraper

from core import jeff_logger

from core import jeff_validators



api = Namespace('web', description='Operations on the RFI website.')



# This model definition is required so it can be registered to the API docs

transcriptions_model = api.model('Transcription', {

    'program_date': fields.String(required=True),

    'encoding': fields.String,

    'title': fields.String,

    'article': fields.List(fields.String),

    'url': fields.Url('trans_pd_ep'),

    'status_code': fields.String,

    'error_message': fields.String

})



@api.route('/transcriptions/<program_date>', endpoint='trans_pd_ep')

@api.param('program_date',

           'The program date for the broadcast. Accepted date format DDMMYYYY.')

@api.doc(model=transcriptions_model)

class Transcriptions(Resource):

    """A Transcriptions resource.

    """

    def get(self, program_date):

        """Gets the transcription from the scrapper.



        Args:

            program_date (str): A string representing the program date



        Returns:

            validate_date_errors (dict): A dict of errors raised by the

                validator for the transcriptions schema.

            data (dict): A dict of attributes from the jeff transcriptions object

                containing the program date, title, article, etc. (see schema).

        """

        # Create validator, validate input

        ts = jeff_validators.TranscriptionsSchema()

        validate_date_errors = ts.validate({'program_date': program_date})

        if validate_date_errors:

            return validate_date_errors



        # Create scrapper, scrape page

        jt = jeff_scraper.JeffTranscription(program_date)



        # Serialise JeffTranscription object to serialised (formatted) dict

        # according to Transcriptions Schema

        data, errors = ts.dump(jt)

        return data

Web scraper

import requests

import logging



from bs4 import BeautifulSoup



from core import jeff_errors

from core import jeff_logger



class JeffTranscription(object):

    """Represents a transcription from the rfi jeff website.



    Attributes:

        program_date (str): A string for the program date of the broadcast,

            accepted date format DDMMYYYY.

        title (str): A string for the title of the transcription.

        article (list(str)): A list of strings for each paragraph in the

            transcription article.

        encoding (str): A string defining the encoding of the transcription.

        is_error (bool): A boolean indicating if an error occurred.

        error_message (str): A string for error messages generated whilst requesting

            the webpage or whilst parsing the content.

        status_code (str): A string indicating the http status code for responses

            from the rfi jeff website.

        url (str): The URL for the rfi jeff webpage for the transcription.

    """



    def __init__(self, program_date):

        """Inits JeffTranscription with the program date."""

        self.program_date = program_date

        self._makeURL()



        self.title = None

        self.article = None

        self.encoding = None



        self.is_error = False

        self.error_message = None

        self.status_code = None



        self._jeff_scraper_logger = jeff_logger.JeffLogger('jeff_scraper_logger')



        try:

            page_response = self._getPageResponse()

            page_content = page_response.content

            self._scrapePage(page_content)



        except (jeff_errors.ScraperConnectionError, jeff_errors.ScraperTimeoutError,

                jeff_errors.ScraperHTTPError, jeff_errors.ScraperParserError) as e:

                self._handleScraperErrors(e)



    def _handleScraperErrors(self, e):

        """Handles errors raised by the methods.



        Sets the is_error and error_message attributes.



        Args:

            e: An error object raised by the class methods

        """



        self.is_error = True

        self.error_message = e.message

        self._jeff_scraper_logger.logger.error(self.error_message)



    def _makeURL(self):

        """Makes the url for the RFI JEFF Website."""

        RFI_JEFF_BASE_URL = 'https://savoirs.rfi.fr/fr/apprendre-enseigner/' 

                            'langue-francaise/journal-en-francais-facile-'

        RFI_JEFF_END_URL = '-20h00-gmt'

        self.url = RFI_JEFF_BASE_URL + self.program_date + RFI_JEFF_END_URL



    def _getPageResponse(self):

        """Gets the response from the webpage.



        Returns:

            A requests.response object.



        Raises:

            ScraperConnectionError: A connection error occurred.

            ScraperTimeoutError: A timeout error occurred.

            ScraperHTTPError: An HTTP Error occurred.

        """

        try:

            HEADERS = {'User-Agent': 'Chrome/61.0.3163.91'}

            page_response = requests.get(self.url, headers=HEADERS, timeout=5)

            self._jeff_scraper_logger.logger.info('Request sent to JEFF URL')

            self.status_code = page_response.status_code

            self.encoding = page_response.encoding

            page_response.raise_for_status()



        except requests.exceptions.ConnectionError as e:

            raise jeff_errors.ScraperConnectionError from e



        except requests.exceptions.Timeout as e:

            raise jeff_errors.ScraperTimeoutError from e



        except requests.exceptions.HTTPError as e:

            raise jeff_errors.ScraperHTTPError from e



        return page_response



    def _scrapePage(self, page_content):

        """Parses the html content from the webpage response.



        Uses the Beautiful Soup library to parse the webpage content to extract

        the title of the broadcast and the transcription. The transcription

        is a series of paragraph elements within an article element that has a

        class attribute defined by ARTICLE_CLASS.



        Example Webpage HTML

            view-source:https://savoirs.rfi.fr/fr/apprendre-enseigner/langue-francaise/journal-en-francais-facile-23072018-20h00-gmt

            Article element at line 518

            First paragraph element at line 532



        Args:

            page_content: The content of the webpage as HTML



        Raises:

            ScraperParserError: An error occurred parsing the page content.

        """



        try:

            ARTICLE_CLASS = "node node-edition node-edition-full term-2707 clearfix"

            bs = BeautifulSoup(page_content, "html.parser")



            title_tag = bs.find('title')

            title_text = title_tag.get_text()



            # Find all the p elements within the article element that has the

            # ARTICLE_CLASS class attribute. Remove newline characters and

            # unwanted unicode characters from the p element's text fields.

            # Create a list of strings, one list element for each paragraph.

            article_p_tags = bs.find('article', class_=ARTICLE_CLASS).find_all('p')

            article_p_text = [p_tag.get_text().replace('n', '').replace(u'xa0', u'')

                              for p_tag in article_p_tags]



            self._jeff_scraper_logger.logger.info('Page content parsed')



            self.title = title_text

            self.article = article_p_text



        except Exception as e:

            raise jeff_errors.ScraperParserError() from e

Notes/Questions:

I've stuck to the Google Python style guide where possible.

The scraping is performed when the scraper is instantiated. Is this good practice or should I create a method to initiate or invoke the scraping code?

The _getPageResponse method contains a try/except to handle HTTP request errors. Is is correct to use a try/except with a function or method to catch exceptions within the function?

The _getPageResponse method is called from within another try/except block within the init method, is it good practice to have try/except within try/except that are within the init methods? The same concern arises with the _scrapePage method.

Should I be using Flask Restplus to generate swagger API specs/docs automatically? Or should I use a separate extension to generate swagger API specs/docs with Flask Restful (if so, what would this swagger extension be)?

edited 21 mins ago

Jamal♦

30.2k11116226

asked Oct 15 at 22:03

25Postcards

112

add a comment |

The code operates as follows:

The API is called with the desired program (broadcast) date as the data payload

The web scraper is then called to scrape the appropriate webpage for that given date

The response of the API call is a transcription of the broadcast in the form of a list of sentences/paragraphs as well as other information

To set expectations, I am a beginner programmer and this is my first independent project. I am seeking a project review and feedback on what I've created so far.

I will include a few sections of code for the web scraper and API only. The full repository can be found here.

You can clone the repo, build the container, and run it to test the program:

git clone https://github.com/25Postcards/rfi_jeff_api

sudo docker build . -t jeff_api:latest

sudo docker run -p 8000:8000 jeff_api

API

from flask_restplus import Namespace, Resource, fields



from core import jeff_scraper

from core import jeff_logger

from core import jeff_validators



api = Namespace('web', description='Operations on the RFI website.')



# This model definition is required so it can be registered to the API docs

transcriptions_model = api.model('Transcription', {

    'program_date': fields.String(required=True),

    'encoding': fields.String,

    'title': fields.String,

    'article': fields.List(fields.String),

    'url': fields.Url('trans_pd_ep'),

    'status_code': fields.String,

    'error_message': fields.String

})



@api.route('/transcriptions/<program_date>', endpoint='trans_pd_ep')

@api.param('program_date',

           'The program date for the broadcast. Accepted date format DDMMYYYY.')

@api.doc(model=transcriptions_model)

class Transcriptions(Resource):

    """A Transcriptions resource.

    """

    def get(self, program_date):

        """Gets the transcription from the scrapper.



        Args:

            program_date (str): A string representing the program date



        Returns:

            validate_date_errors (dict): A dict of errors raised by the

                validator for the transcriptions schema.

            data (dict): A dict of attributes from the jeff transcriptions object

                containing the program date, title, article, etc. (see schema).

        """

        # Create validator, validate input

        ts = jeff_validators.TranscriptionsSchema()

        validate_date_errors = ts.validate({'program_date': program_date})

        if validate_date_errors:

            return validate_date_errors



        # Create scrapper, scrape page

        jt = jeff_scraper.JeffTranscription(program_date)



        # Serialise JeffTranscription object to serialised (formatted) dict

        # according to Transcriptions Schema

        data, errors = ts.dump(jt)

        return data

Web scraper

import requests

import logging



from bs4 import BeautifulSoup



from core import jeff_errors

from core import jeff_logger



class JeffTranscription(object):

    """Represents a transcription from the rfi jeff website.



    Attributes:

        program_date (str): A string for the program date of the broadcast,

            accepted date format DDMMYYYY.

        title (str): A string for the title of the transcription.

        article (list(str)): A list of strings for each paragraph in the

            transcription article.

        encoding (str): A string defining the encoding of the transcription.

        is_error (bool): A boolean indicating if an error occurred.

        error_message (str): A string for error messages generated whilst requesting

            the webpage or whilst parsing the content.

        status_code (str): A string indicating the http status code for responses

            from the rfi jeff website.

        url (str): The URL for the rfi jeff webpage for the transcription.

    """



    def __init__(self, program_date):

        """Inits JeffTranscription with the program date."""

        self.program_date = program_date

        self._makeURL()



        self.title = None

        self.article = None

        self.encoding = None



        self.is_error = False

        self.error_message = None

        self.status_code = None



        self._jeff_scraper_logger = jeff_logger.JeffLogger('jeff_scraper_logger')



        try:

            page_response = self._getPageResponse()

            page_content = page_response.content

            self._scrapePage(page_content)



        except (jeff_errors.ScraperConnectionError, jeff_errors.ScraperTimeoutError,

                jeff_errors.ScraperHTTPError, jeff_errors.ScraperParserError) as e:

                self._handleScraperErrors(e)



    def _handleScraperErrors(self, e):

        """Handles errors raised by the methods.



        Sets the is_error and error_message attributes.



        Args:

            e: An error object raised by the class methods

        """



        self.is_error = True

        self.error_message = e.message

        self._jeff_scraper_logger.logger.error(self.error_message)



    def _makeURL(self):

        """Makes the url for the RFI JEFF Website."""

        RFI_JEFF_BASE_URL = 'https://savoirs.rfi.fr/fr/apprendre-enseigner/' 

                            'langue-francaise/journal-en-francais-facile-'

        RFI_JEFF_END_URL = '-20h00-gmt'

        self.url = RFI_JEFF_BASE_URL + self.program_date + RFI_JEFF_END_URL



    def _getPageResponse(self):

        """Gets the response from the webpage.



        Returns:

            A requests.response object.



        Raises:

            ScraperConnectionError: A connection error occurred.

            ScraperTimeoutError: A timeout error occurred.

            ScraperHTTPError: An HTTP Error occurred.

        """

        try:

            HEADERS = {'User-Agent': 'Chrome/61.0.3163.91'}

            page_response = requests.get(self.url, headers=HEADERS, timeout=5)

            self._jeff_scraper_logger.logger.info('Request sent to JEFF URL')

            self.status_code = page_response.status_code

            self.encoding = page_response.encoding

            page_response.raise_for_status()



        except requests.exceptions.ConnectionError as e:

            raise jeff_errors.ScraperConnectionError from e



        except requests.exceptions.Timeout as e:

            raise jeff_errors.ScraperTimeoutError from e



        except requests.exceptions.HTTPError as e:

            raise jeff_errors.ScraperHTTPError from e



        return page_response



    def _scrapePage(self, page_content):

        """Parses the html content from the webpage response.



        Uses the Beautiful Soup library to parse the webpage content to extract

        the title of the broadcast and the transcription. The transcription

        is a series of paragraph elements within an article element that has a

        class attribute defined by ARTICLE_CLASS.



        Example Webpage HTML

            view-source:https://savoirs.rfi.fr/fr/apprendre-enseigner/langue-francaise/journal-en-francais-facile-23072018-20h00-gmt

            Article element at line 518

            First paragraph element at line 532



        Args:

            page_content: The content of the webpage as HTML



        Raises:

            ScraperParserError: An error occurred parsing the page content.

        """



        try:

            ARTICLE_CLASS = "node node-edition node-edition-full term-2707 clearfix"

            bs = BeautifulSoup(page_content, "html.parser")



            title_tag = bs.find('title')

            title_text = title_tag.get_text()



            # Find all the p elements within the article element that has the

            # ARTICLE_CLASS class attribute. Remove newline characters and

            # unwanted unicode characters from the p element's text fields.

            # Create a list of strings, one list element for each paragraph.

            article_p_tags = bs.find('article', class_=ARTICLE_CLASS).find_all('p')

            article_p_text = [p_tag.get_text().replace('n', '').replace(u'xa0', u'')

                              for p_tag in article_p_tags]



            self._jeff_scraper_logger.logger.info('Page content parsed')



            self.title = title_text

            self.article = article_p_text



        except Exception as e:

            raise jeff_errors.ScraperParserError() from e

Notes/Questions:

I've stuck to the Google Python style guide where possible.

The scraping is performed when the scraper is instantiated. Is this good practice or should I create a method to initiate or invoke the scraping code?

The _getPageResponse method contains a try/except to handle HTTP request errors. Is is correct to use a try/except with a function or method to catch exceptions within the function?

The _getPageResponse method is called from within another try/except block within the init method, is it good practice to have try/except within try/except that are within the init methods? The same concern arises with the _scrapePage method.

Should I be using Flask Restplus to generate swagger API specs/docs automatically? Or should I use a separate extension to generate swagger API specs/docs with Flask Restful (if so, what would this swagger extension be)?

edited 21 mins ago

Jamal♦

30.2k11116226

asked Oct 15 at 22:03

25Postcards

112

The code operates as follows:

The API is called with the desired program (broadcast) date as the data payload

The web scraper is then called to scrape the appropriate webpage for that given date

The response of the API call is a transcription of the broadcast in the form of a list of sentences/paragraphs as well as other information

To set expectations, I am a beginner programmer and this is my first independent project. I am seeking a project review and feedback on what I've created so far.

I will include a few sections of code for the web scraper and API only. The full repository can be found here.

You can clone the repo, build the container, and run it to test the program:

git clone https://github.com/25Postcards/rfi_jeff_api

sudo docker build . -t jeff_api:latest

sudo docker run -p 8000:8000 jeff_api

API

from flask_restplus import Namespace, Resource, fields



from core import jeff_scraper

from core import jeff_logger

from core import jeff_validators



api = Namespace('web', description='Operations on the RFI website.')



# This model definition is required so it can be registered to the API docs

transcriptions_model = api.model('Transcription', {

    'program_date': fields.String(required=True),

    'encoding': fields.String,

    'title': fields.String,

    'article': fields.List(fields.String),

    'url': fields.Url('trans_pd_ep'),

    'status_code': fields.String,

    'error_message': fields.String

})



@api.route('/transcriptions/<program_date>', endpoint='trans_pd_ep')

@api.param('program_date',

           'The program date for the broadcast. Accepted date format DDMMYYYY.')

@api.doc(model=transcriptions_model)

class Transcriptions(Resource):

    """A Transcriptions resource.

    """

    def get(self, program_date):

        """Gets the transcription from the scrapper.



        Args:

            program_date (str): A string representing the program date



        Returns:

            validate_date_errors (dict): A dict of errors raised by the

                validator for the transcriptions schema.

            data (dict): A dict of attributes from the jeff transcriptions object

                containing the program date, title, article, etc. (see schema).

        """

        # Create validator, validate input

        ts = jeff_validators.TranscriptionsSchema()

        validate_date_errors = ts.validate({'program_date': program_date})

        if validate_date_errors:

            return validate_date_errors



        # Create scrapper, scrape page

        jt = jeff_scraper.JeffTranscription(program_date)



        # Serialise JeffTranscription object to serialised (formatted) dict

        # according to Transcriptions Schema

        data, errors = ts.dump(jt)

        return data

Web scraper

import requests

import logging



from bs4 import BeautifulSoup



from core import jeff_errors

from core import jeff_logger



class JeffTranscription(object):

    """Represents a transcription from the rfi jeff website.



    Attributes:

        program_date (str): A string for the program date of the broadcast,

            accepted date format DDMMYYYY.

        title (str): A string for the title of the transcription.

        article (list(str)): A list of strings for each paragraph in the

            transcription article.

        encoding (str): A string defining the encoding of the transcription.

        is_error (bool): A boolean indicating if an error occurred.

        error_message (str): A string for error messages generated whilst requesting

            the webpage or whilst parsing the content.

        status_code (str): A string indicating the http status code for responses

            from the rfi jeff website.

        url (str): The URL for the rfi jeff webpage for the transcription.

    """



    def __init__(self, program_date):

        """Inits JeffTranscription with the program date."""

        self.program_date = program_date

        self._makeURL()



        self.title = None

        self.article = None

        self.encoding = None



        self.is_error = False

        self.error_message = None

        self.status_code = None



        self._jeff_scraper_logger = jeff_logger.JeffLogger('jeff_scraper_logger')



        try:

            page_response = self._getPageResponse()

            page_content = page_response.content

            self._scrapePage(page_content)



        except (jeff_errors.ScraperConnectionError, jeff_errors.ScraperTimeoutError,

                jeff_errors.ScraperHTTPError, jeff_errors.ScraperParserError) as e:

                self._handleScraperErrors(e)



    def _handleScraperErrors(self, e):

        """Handles errors raised by the methods.



        Sets the is_error and error_message attributes.



        Args:

            e: An error object raised by the class methods

        """



        self.is_error = True

        self.error_message = e.message

        self._jeff_scraper_logger.logger.error(self.error_message)



    def _makeURL(self):

        """Makes the url for the RFI JEFF Website."""

        RFI_JEFF_BASE_URL = 'https://savoirs.rfi.fr/fr/apprendre-enseigner/' 

                            'langue-francaise/journal-en-francais-facile-'

        RFI_JEFF_END_URL = '-20h00-gmt'

        self.url = RFI_JEFF_BASE_URL + self.program_date + RFI_JEFF_END_URL



    def _getPageResponse(self):

        """Gets the response from the webpage.



        Returns:

            A requests.response object.



        Raises:

            ScraperConnectionError: A connection error occurred.

            ScraperTimeoutError: A timeout error occurred.

            ScraperHTTPError: An HTTP Error occurred.

        """

        try:

            HEADERS = {'User-Agent': 'Chrome/61.0.3163.91'}

            page_response = requests.get(self.url, headers=HEADERS, timeout=5)

            self._jeff_scraper_logger.logger.info('Request sent to JEFF URL')

            self.status_code = page_response.status_code

            self.encoding = page_response.encoding

            page_response.raise_for_status()



        except requests.exceptions.ConnectionError as e:

            raise jeff_errors.ScraperConnectionError from e



        except requests.exceptions.Timeout as e:

            raise jeff_errors.ScraperTimeoutError from e



        except requests.exceptions.HTTPError as e:

            raise jeff_errors.ScraperHTTPError from e



        return page_response



    def _scrapePage(self, page_content):

        """Parses the html content from the webpage response.



        Uses the Beautiful Soup library to parse the webpage content to extract

        the title of the broadcast and the transcription. The transcription

        is a series of paragraph elements within an article element that has a

        class attribute defined by ARTICLE_CLASS.



        Example Webpage HTML

            view-source:https://savoirs.rfi.fr/fr/apprendre-enseigner/langue-francaise/journal-en-francais-facile-23072018-20h00-gmt

            Article element at line 518

            First paragraph element at line 532



        Args:

            page_content: The content of the webpage as HTML



        Raises:

            ScraperParserError: An error occurred parsing the page content.

        """



        try:

            ARTICLE_CLASS = "node node-edition node-edition-full term-2707 clearfix"

            bs = BeautifulSoup(page_content, "html.parser")



            title_tag = bs.find('title')

            title_text = title_tag.get_text()



            # Find all the p elements within the article element that has the

            # ARTICLE_CLASS class attribute. Remove newline characters and

            # unwanted unicode characters from the p element's text fields.

            # Create a list of strings, one list element for each paragraph.

            article_p_tags = bs.find('article', class_=ARTICLE_CLASS).find_all('p')

            article_p_text = [p_tag.get_text().replace('n', '').replace(u'xa0', u'')

                              for p_tag in article_p_tags]



            self._jeff_scraper_logger.logger.info('Page content parsed')



            self.title = title_text

            self.article = article_p_text



        except Exception as e:

            raise jeff_errors.ScraperParserError() from e

Notes/Questions:

I've stuck to the Google Python style guide where possible.

The scraping is performed when the scraper is instantiated. Is this good practice or should I create a method to initiate or invoke the scraping code?

The _getPageResponse method contains a try/except to handle HTTP request errors. Is is correct to use a try/except with a function or method to catch exceptions within the function?

The _getPageResponse method is called from within another try/except block within the init method, is it good practice to have try/except within try/except that are within the init methods? The same concern arises with the _scrapePage method.

Should I be using Flask Restplus to generate swagger API specs/docs automatically? Or should I use a separate extension to generate swagger API specs/docs with Flask Restful (if so, what would this swagger extension be)?

python web-scraping api beautifulsoup flask

edited 21 mins ago

Jamal♦

30.2k11116226

asked Oct 15 at 22:03

25Postcards

112

edited 21 mins ago

Jamal♦

30.2k11116226

asked Oct 15 at 22:03

25Postcards

112

edited 21 mins ago

Jamal♦

30.2k11116226

edited 21 mins ago

Jamal♦

30.2k11116226

edited 21 mins ago

Jamal♦

30.2k11116226

asked Oct 15 at 22:03

25Postcards

112

asked Oct 15 at 22:03

25Postcards

112

asked Oct 15 at 22:03

25Postcards

112

add a comment |

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f205639%2fwrapping-a-web-scraper-in-a-restful-api%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Tukukkk