NodeJs Crawler - How can I make it more scalable and maintainable












0












$begingroup$


Here is my try on a crawler made in nodeJs with cheerio, I made it with the idea in mind to use it in a future project I wanna make. Here is the git link: https://github.com/Just4lol/CookCrawler



If you look at the index.js you will see how to use it;



const cookCrawler = require('./cookCrawler.js')

cookCrawler.getRecipeData(recipeUrl).then(data => {
console.log(data)
})


I think this part is ok (will still take your feedback :) ), my problem is with the structure behind. For each website that I want to parse the data from, I need to create a new parser script so to try to save some code duplication and add structure to the project I created the RecipeParser class which they extend.



class RecipeParser {
async loadHtml(url) {
this.recipeUrl = url;

try {
const recipeHtml = await requestP(url);
// Load the virtual DOM
this.$ = cheerio.load(recipeHtml);

return this;
}
catch (err) {
console.log(err);
}
}

async parseHtml(url) {
try {
await this.loadHtml(url)
return this.parse()
}
catch(err) {
console.log(err);
}
}

getTitle(selector) {
return this.whiteSpaceRemover(this.$(selector).text())
}

getRecipeInfo(selector) {
throw new Error('You have to implement the method getRecipeInfo!');
}

getIngredients(selector) {

throw new Error('You have to implement the method getIngredients!');
}

getSteps(selector) {
throw new Error('You have to implement the method getSteps!');
}

getRecipeImgUrl(selector) {
return this.$(selector).attr('href')
}

/**
* Return the obj
*/
parse() {
return {
recipeUrl: this.recipeUrl,
title: this.getTitle(),
recipeInfo: this.getRecipeInfo(),
ingredients: this.getIngredients(),
steps: this.getSteps(),
recipeImgUrl: this.getRecipeImgUrl()
}
}

getTxtArrayFromElements(selector) {
const array =
this.$(selector).each((i, element) => {
array.push(this.$(element).text())
})

return array
}

whiteSpaceRemover(string) {
return string.replace(whiteSpaceRemReg, '')
}
}

module.exports = RecipeParser


It does its job but is not very scalable if I want to add additional properties and in general, I'm not really happy with it and I'm sure there is a better way of doing it. I liked the idea of mixin but, I may be wrong, but I don't think it would be helpful in my case because each website is too unique and one logic from one cannot really apply to another. Here the function from ricardoParse.js to extract the ingredients from one of their recipes



getIngredients() {
// If the form have h3 in it, that mean the recipe have 2 recipe in it
if(this.$('#formIngredients > h3').length) {
let obj = {}
// for each recipe title link the array of ingredients to it
this.$('#formIngredients > h3').each((i, element) => {
obj[this.$(element).text()] = (() => {
const ingredients =
this.$(this.$('#formIngredients > ul ')[i]).find('li').each((j, ulElement) => {
ingredients.push(this.whiteSpaceRemover(this.$(ulElement).text()))
})

return ingredients
})()
})

return obj
}
else return this.getTxtArrayFromElements('#formIngredients ul > li > label > span')
}


The last thing is the "factory" im using would be a real pain if I have more than 10 parser in it;




class CookCrawler {
static getRecipeData(url) {
const domain = url.match(domainMatchReg).toString()
switch(domain) {
case 'https://www.ricardocuisine.com':
const ricardoParser = new RicardoParser()

return ricardoParser.parseHtml(url)
case 'https://www.troisfoisparjour.com':
const troisfoisparjourParse = new TroisfoisparjourParser()

return troisfoisparjourParse.parseHtml(url)
default:
console.warn('No parser exist for ths domain or wrong url.')
break
}
}
}


So, like the title says How can I make it more scalable and maintainable and I have done a good job? ;) I hope this is a "good question" and I look forward to any feedback and comment on my project.



J.R









share







New contributor




Just4lol is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$

















    0












    $begingroup$


    Here is my try on a crawler made in nodeJs with cheerio, I made it with the idea in mind to use it in a future project I wanna make. Here is the git link: https://github.com/Just4lol/CookCrawler



    If you look at the index.js you will see how to use it;



    const cookCrawler = require('./cookCrawler.js')

    cookCrawler.getRecipeData(recipeUrl).then(data => {
    console.log(data)
    })


    I think this part is ok (will still take your feedback :) ), my problem is with the structure behind. For each website that I want to parse the data from, I need to create a new parser script so to try to save some code duplication and add structure to the project I created the RecipeParser class which they extend.



    class RecipeParser {
    async loadHtml(url) {
    this.recipeUrl = url;

    try {
    const recipeHtml = await requestP(url);
    // Load the virtual DOM
    this.$ = cheerio.load(recipeHtml);

    return this;
    }
    catch (err) {
    console.log(err);
    }
    }

    async parseHtml(url) {
    try {
    await this.loadHtml(url)
    return this.parse()
    }
    catch(err) {
    console.log(err);
    }
    }

    getTitle(selector) {
    return this.whiteSpaceRemover(this.$(selector).text())
    }

    getRecipeInfo(selector) {
    throw new Error('You have to implement the method getRecipeInfo!');
    }

    getIngredients(selector) {

    throw new Error('You have to implement the method getIngredients!');
    }

    getSteps(selector) {
    throw new Error('You have to implement the method getSteps!');
    }

    getRecipeImgUrl(selector) {
    return this.$(selector).attr('href')
    }

    /**
    * Return the obj
    */
    parse() {
    return {
    recipeUrl: this.recipeUrl,
    title: this.getTitle(),
    recipeInfo: this.getRecipeInfo(),
    ingredients: this.getIngredients(),
    steps: this.getSteps(),
    recipeImgUrl: this.getRecipeImgUrl()
    }
    }

    getTxtArrayFromElements(selector) {
    const array =
    this.$(selector).each((i, element) => {
    array.push(this.$(element).text())
    })

    return array
    }

    whiteSpaceRemover(string) {
    return string.replace(whiteSpaceRemReg, '')
    }
    }

    module.exports = RecipeParser


    It does its job but is not very scalable if I want to add additional properties and in general, I'm not really happy with it and I'm sure there is a better way of doing it. I liked the idea of mixin but, I may be wrong, but I don't think it would be helpful in my case because each website is too unique and one logic from one cannot really apply to another. Here the function from ricardoParse.js to extract the ingredients from one of their recipes



    getIngredients() {
    // If the form have h3 in it, that mean the recipe have 2 recipe in it
    if(this.$('#formIngredients > h3').length) {
    let obj = {}
    // for each recipe title link the array of ingredients to it
    this.$('#formIngredients > h3').each((i, element) => {
    obj[this.$(element).text()] = (() => {
    const ingredients =
    this.$(this.$('#formIngredients > ul ')[i]).find('li').each((j, ulElement) => {
    ingredients.push(this.whiteSpaceRemover(this.$(ulElement).text()))
    })

    return ingredients
    })()
    })

    return obj
    }
    else return this.getTxtArrayFromElements('#formIngredients ul > li > label > span')
    }


    The last thing is the "factory" im using would be a real pain if I have more than 10 parser in it;




    class CookCrawler {
    static getRecipeData(url) {
    const domain = url.match(domainMatchReg).toString()
    switch(domain) {
    case 'https://www.ricardocuisine.com':
    const ricardoParser = new RicardoParser()

    return ricardoParser.parseHtml(url)
    case 'https://www.troisfoisparjour.com':
    const troisfoisparjourParse = new TroisfoisparjourParser()

    return troisfoisparjourParse.parseHtml(url)
    default:
    console.warn('No parser exist for ths domain or wrong url.')
    break
    }
    }
    }


    So, like the title says How can I make it more scalable and maintainable and I have done a good job? ;) I hope this is a "good question" and I look forward to any feedback and comment on my project.



    J.R









    share







    New contributor




    Just4lol is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.







    $endgroup$















      0












      0








      0





      $begingroup$


      Here is my try on a crawler made in nodeJs with cheerio, I made it with the idea in mind to use it in a future project I wanna make. Here is the git link: https://github.com/Just4lol/CookCrawler



      If you look at the index.js you will see how to use it;



      const cookCrawler = require('./cookCrawler.js')

      cookCrawler.getRecipeData(recipeUrl).then(data => {
      console.log(data)
      })


      I think this part is ok (will still take your feedback :) ), my problem is with the structure behind. For each website that I want to parse the data from, I need to create a new parser script so to try to save some code duplication and add structure to the project I created the RecipeParser class which they extend.



      class RecipeParser {
      async loadHtml(url) {
      this.recipeUrl = url;

      try {
      const recipeHtml = await requestP(url);
      // Load the virtual DOM
      this.$ = cheerio.load(recipeHtml);

      return this;
      }
      catch (err) {
      console.log(err);
      }
      }

      async parseHtml(url) {
      try {
      await this.loadHtml(url)
      return this.parse()
      }
      catch(err) {
      console.log(err);
      }
      }

      getTitle(selector) {
      return this.whiteSpaceRemover(this.$(selector).text())
      }

      getRecipeInfo(selector) {
      throw new Error('You have to implement the method getRecipeInfo!');
      }

      getIngredients(selector) {

      throw new Error('You have to implement the method getIngredients!');
      }

      getSteps(selector) {
      throw new Error('You have to implement the method getSteps!');
      }

      getRecipeImgUrl(selector) {
      return this.$(selector).attr('href')
      }

      /**
      * Return the obj
      */
      parse() {
      return {
      recipeUrl: this.recipeUrl,
      title: this.getTitle(),
      recipeInfo: this.getRecipeInfo(),
      ingredients: this.getIngredients(),
      steps: this.getSteps(),
      recipeImgUrl: this.getRecipeImgUrl()
      }
      }

      getTxtArrayFromElements(selector) {
      const array =
      this.$(selector).each((i, element) => {
      array.push(this.$(element).text())
      })

      return array
      }

      whiteSpaceRemover(string) {
      return string.replace(whiteSpaceRemReg, '')
      }
      }

      module.exports = RecipeParser


      It does its job but is not very scalable if I want to add additional properties and in general, I'm not really happy with it and I'm sure there is a better way of doing it. I liked the idea of mixin but, I may be wrong, but I don't think it would be helpful in my case because each website is too unique and one logic from one cannot really apply to another. Here the function from ricardoParse.js to extract the ingredients from one of their recipes



      getIngredients() {
      // If the form have h3 in it, that mean the recipe have 2 recipe in it
      if(this.$('#formIngredients > h3').length) {
      let obj = {}
      // for each recipe title link the array of ingredients to it
      this.$('#formIngredients > h3').each((i, element) => {
      obj[this.$(element).text()] = (() => {
      const ingredients =
      this.$(this.$('#formIngredients > ul ')[i]).find('li').each((j, ulElement) => {
      ingredients.push(this.whiteSpaceRemover(this.$(ulElement).text()))
      })

      return ingredients
      })()
      })

      return obj
      }
      else return this.getTxtArrayFromElements('#formIngredients ul > li > label > span')
      }


      The last thing is the "factory" im using would be a real pain if I have more than 10 parser in it;




      class CookCrawler {
      static getRecipeData(url) {
      const domain = url.match(domainMatchReg).toString()
      switch(domain) {
      case 'https://www.ricardocuisine.com':
      const ricardoParser = new RicardoParser()

      return ricardoParser.parseHtml(url)
      case 'https://www.troisfoisparjour.com':
      const troisfoisparjourParse = new TroisfoisparjourParser()

      return troisfoisparjourParse.parseHtml(url)
      default:
      console.warn('No parser exist for ths domain or wrong url.')
      break
      }
      }
      }


      So, like the title says How can I make it more scalable and maintainable and I have done a good job? ;) I hope this is a "good question" and I look forward to any feedback and comment on my project.



      J.R









      share







      New contributor




      Just4lol is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.







      $endgroup$




      Here is my try on a crawler made in nodeJs with cheerio, I made it with the idea in mind to use it in a future project I wanna make. Here is the git link: https://github.com/Just4lol/CookCrawler



      If you look at the index.js you will see how to use it;



      const cookCrawler = require('./cookCrawler.js')

      cookCrawler.getRecipeData(recipeUrl).then(data => {
      console.log(data)
      })


      I think this part is ok (will still take your feedback :) ), my problem is with the structure behind. For each website that I want to parse the data from, I need to create a new parser script so to try to save some code duplication and add structure to the project I created the RecipeParser class which they extend.



      class RecipeParser {
      async loadHtml(url) {
      this.recipeUrl = url;

      try {
      const recipeHtml = await requestP(url);
      // Load the virtual DOM
      this.$ = cheerio.load(recipeHtml);

      return this;
      }
      catch (err) {
      console.log(err);
      }
      }

      async parseHtml(url) {
      try {
      await this.loadHtml(url)
      return this.parse()
      }
      catch(err) {
      console.log(err);
      }
      }

      getTitle(selector) {
      return this.whiteSpaceRemover(this.$(selector).text())
      }

      getRecipeInfo(selector) {
      throw new Error('You have to implement the method getRecipeInfo!');
      }

      getIngredients(selector) {

      throw new Error('You have to implement the method getIngredients!');
      }

      getSteps(selector) {
      throw new Error('You have to implement the method getSteps!');
      }

      getRecipeImgUrl(selector) {
      return this.$(selector).attr('href')
      }

      /**
      * Return the obj
      */
      parse() {
      return {
      recipeUrl: this.recipeUrl,
      title: this.getTitle(),
      recipeInfo: this.getRecipeInfo(),
      ingredients: this.getIngredients(),
      steps: this.getSteps(),
      recipeImgUrl: this.getRecipeImgUrl()
      }
      }

      getTxtArrayFromElements(selector) {
      const array =
      this.$(selector).each((i, element) => {
      array.push(this.$(element).text())
      })

      return array
      }

      whiteSpaceRemover(string) {
      return string.replace(whiteSpaceRemReg, '')
      }
      }

      module.exports = RecipeParser


      It does its job but is not very scalable if I want to add additional properties and in general, I'm not really happy with it and I'm sure there is a better way of doing it. I liked the idea of mixin but, I may be wrong, but I don't think it would be helpful in my case because each website is too unique and one logic from one cannot really apply to another. Here the function from ricardoParse.js to extract the ingredients from one of their recipes



      getIngredients() {
      // If the form have h3 in it, that mean the recipe have 2 recipe in it
      if(this.$('#formIngredients > h3').length) {
      let obj = {}
      // for each recipe title link the array of ingredients to it
      this.$('#formIngredients > h3').each((i, element) => {
      obj[this.$(element).text()] = (() => {
      const ingredients =
      this.$(this.$('#formIngredients > ul ')[i]).find('li').each((j, ulElement) => {
      ingredients.push(this.whiteSpaceRemover(this.$(ulElement).text()))
      })

      return ingredients
      })()
      })

      return obj
      }
      else return this.getTxtArrayFromElements('#formIngredients ul > li > label > span')
      }


      The last thing is the "factory" im using would be a real pain if I have more than 10 parser in it;




      class CookCrawler {
      static getRecipeData(url) {
      const domain = url.match(domainMatchReg).toString()
      switch(domain) {
      case 'https://www.ricardocuisine.com':
      const ricardoParser = new RicardoParser()

      return ricardoParser.parseHtml(url)
      case 'https://www.troisfoisparjour.com':
      const troisfoisparjourParse = new TroisfoisparjourParser()

      return troisfoisparjourParse.parseHtml(url)
      default:
      console.warn('No parser exist for ths domain or wrong url.')
      break
      }
      }
      }


      So, like the title says How can I make it more scalable and maintainable and I have done a good job? ;) I hope this is a "good question" and I look forward to any feedback and comment on my project.



      J.R







      javascript node.js web-scraping





      share







      New contributor




      Just4lol is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.










      share







      New contributor




      Just4lol is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.








      share



      share






      New contributor




      Just4lol is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked 9 mins ago









      Just4lolJust4lol

      1




      1




      New contributor




      Just4lol is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Just4lol is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Just4lol is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          0






          active

          oldest

          votes











          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "196"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });






          Just4lol is a new contributor. Be nice, and check out our Code of Conduct.










          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f214509%2fnodejs-crawler-how-can-i-make-it-more-scalable-and-maintainable%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          Just4lol is a new contributor. Be nice, and check out our Code of Conduct.










          draft saved

          draft discarded


















          Just4lol is a new contributor. Be nice, and check out our Code of Conduct.













          Just4lol is a new contributor. Be nice, and check out our Code of Conduct.












          Just4lol is a new contributor. Be nice, and check out our Code of Conduct.
















          Thanks for contributing an answer to Code Review Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f214509%2fnodejs-crawler-how-can-i-make-it-more-scalable-and-maintainable%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          404 Error Contact Form 7 ajax form submitting

          How to know if a Active Directory user can login interactively

          TypeError: fit_transform() missing 1 required positional argument: 'X'