NodeJs Crawler - How can I make it more scalable and maintainable
$begingroup$
Here is my try on a crawler made in nodeJs with cheerio, I made it with the idea in mind to use it in a future project I wanna make. Here is the git link: https://github.com/Just4lol/CookCrawler
If you look at the index.js you will see how to use it;
const cookCrawler = require('./cookCrawler.js')
cookCrawler.getRecipeData(recipeUrl).then(data => {
console.log(data)
})
I think this part is ok (will still take your feedback :) ), my problem is with the structure behind. For each website that I want to parse the data from, I need to create a new parser script so to try to save some code duplication and add structure to the project I created the RecipeParser class which they extend.
class RecipeParser {
async loadHtml(url) {
this.recipeUrl = url;
try {
const recipeHtml = await requestP(url);
// Load the virtual DOM
this.$ = cheerio.load(recipeHtml);
return this;
}
catch (err) {
console.log(err);
}
}
async parseHtml(url) {
try {
await this.loadHtml(url)
return this.parse()
}
catch(err) {
console.log(err);
}
}
getTitle(selector) {
return this.whiteSpaceRemover(this.$(selector).text())
}
getRecipeInfo(selector) {
throw new Error('You have to implement the method getRecipeInfo!');
}
getIngredients(selector) {
throw new Error('You have to implement the method getIngredients!');
}
getSteps(selector) {
throw new Error('You have to implement the method getSteps!');
}
getRecipeImgUrl(selector) {
return this.$(selector).attr('href')
}
/**
* Return the obj
*/
parse() {
return {
recipeUrl: this.recipeUrl,
title: this.getTitle(),
recipeInfo: this.getRecipeInfo(),
ingredients: this.getIngredients(),
steps: this.getSteps(),
recipeImgUrl: this.getRecipeImgUrl()
}
}
getTxtArrayFromElements(selector) {
const array =
this.$(selector).each((i, element) => {
array.push(this.$(element).text())
})
return array
}
whiteSpaceRemover(string) {
return string.replace(whiteSpaceRemReg, '')
}
}
module.exports = RecipeParser
It does its job but is not very scalable if I want to add additional properties and in general, I'm not really happy with it and I'm sure there is a better way of doing it. I liked the idea of mixin but, I may be wrong, but I don't think it would be helpful in my case because each website is too unique and one logic from one cannot really apply to another. Here the function from ricardoParse.js to extract the ingredients from one of their recipes
getIngredients() {
// If the form have h3 in it, that mean the recipe have 2 recipe in it
if(this.$('#formIngredients > h3').length) {
let obj = {}
// for each recipe title link the array of ingredients to it
this.$('#formIngredients > h3').each((i, element) => {
obj[this.$(element).text()] = (() => {
const ingredients =
this.$(this.$('#formIngredients > ul ')[i]).find('li').each((j, ulElement) => {
ingredients.push(this.whiteSpaceRemover(this.$(ulElement).text()))
})
return ingredients
})()
})
return obj
}
else return this.getTxtArrayFromElements('#formIngredients ul > li > label > span')
}
The last thing is the "factory" im using would be a real pain if I have more than 10 parser in it;
class CookCrawler {
static getRecipeData(url) {
const domain = url.match(domainMatchReg).toString()
switch(domain) {
case 'https://www.ricardocuisine.com':
const ricardoParser = new RicardoParser()
return ricardoParser.parseHtml(url)
case 'https://www.troisfoisparjour.com':
const troisfoisparjourParse = new TroisfoisparjourParser()
return troisfoisparjourParse.parseHtml(url)
default:
console.warn('No parser exist for ths domain or wrong url.')
break
}
}
}
So, like the title says How can I make it more scalable and maintainable and I have done a good job? ;) I hope this is a "good question" and I look forward to any feedback and comment on my project.
J.R
javascript node.js web-scraping
New contributor
$endgroup$
add a comment |
$begingroup$
Here is my try on a crawler made in nodeJs with cheerio, I made it with the idea in mind to use it in a future project I wanna make. Here is the git link: https://github.com/Just4lol/CookCrawler
If you look at the index.js you will see how to use it;
const cookCrawler = require('./cookCrawler.js')
cookCrawler.getRecipeData(recipeUrl).then(data => {
console.log(data)
})
I think this part is ok (will still take your feedback :) ), my problem is with the structure behind. For each website that I want to parse the data from, I need to create a new parser script so to try to save some code duplication and add structure to the project I created the RecipeParser class which they extend.
class RecipeParser {
async loadHtml(url) {
this.recipeUrl = url;
try {
const recipeHtml = await requestP(url);
// Load the virtual DOM
this.$ = cheerio.load(recipeHtml);
return this;
}
catch (err) {
console.log(err);
}
}
async parseHtml(url) {
try {
await this.loadHtml(url)
return this.parse()
}
catch(err) {
console.log(err);
}
}
getTitle(selector) {
return this.whiteSpaceRemover(this.$(selector).text())
}
getRecipeInfo(selector) {
throw new Error('You have to implement the method getRecipeInfo!');
}
getIngredients(selector) {
throw new Error('You have to implement the method getIngredients!');
}
getSteps(selector) {
throw new Error('You have to implement the method getSteps!');
}
getRecipeImgUrl(selector) {
return this.$(selector).attr('href')
}
/**
* Return the obj
*/
parse() {
return {
recipeUrl: this.recipeUrl,
title: this.getTitle(),
recipeInfo: this.getRecipeInfo(),
ingredients: this.getIngredients(),
steps: this.getSteps(),
recipeImgUrl: this.getRecipeImgUrl()
}
}
getTxtArrayFromElements(selector) {
const array =
this.$(selector).each((i, element) => {
array.push(this.$(element).text())
})
return array
}
whiteSpaceRemover(string) {
return string.replace(whiteSpaceRemReg, '')
}
}
module.exports = RecipeParser
It does its job but is not very scalable if I want to add additional properties and in general, I'm not really happy with it and I'm sure there is a better way of doing it. I liked the idea of mixin but, I may be wrong, but I don't think it would be helpful in my case because each website is too unique and one logic from one cannot really apply to another. Here the function from ricardoParse.js to extract the ingredients from one of their recipes
getIngredients() {
// If the form have h3 in it, that mean the recipe have 2 recipe in it
if(this.$('#formIngredients > h3').length) {
let obj = {}
// for each recipe title link the array of ingredients to it
this.$('#formIngredients > h3').each((i, element) => {
obj[this.$(element).text()] = (() => {
const ingredients =
this.$(this.$('#formIngredients > ul ')[i]).find('li').each((j, ulElement) => {
ingredients.push(this.whiteSpaceRemover(this.$(ulElement).text()))
})
return ingredients
})()
})
return obj
}
else return this.getTxtArrayFromElements('#formIngredients ul > li > label > span')
}
The last thing is the "factory" im using would be a real pain if I have more than 10 parser in it;
class CookCrawler {
static getRecipeData(url) {
const domain = url.match(domainMatchReg).toString()
switch(domain) {
case 'https://www.ricardocuisine.com':
const ricardoParser = new RicardoParser()
return ricardoParser.parseHtml(url)
case 'https://www.troisfoisparjour.com':
const troisfoisparjourParse = new TroisfoisparjourParser()
return troisfoisparjourParse.parseHtml(url)
default:
console.warn('No parser exist for ths domain or wrong url.')
break
}
}
}
So, like the title says How can I make it more scalable and maintainable and I have done a good job? ;) I hope this is a "good question" and I look forward to any feedback and comment on my project.
J.R
javascript node.js web-scraping
New contributor
$endgroup$
add a comment |
$begingroup$
Here is my try on a crawler made in nodeJs with cheerio, I made it with the idea in mind to use it in a future project I wanna make. Here is the git link: https://github.com/Just4lol/CookCrawler
If you look at the index.js you will see how to use it;
const cookCrawler = require('./cookCrawler.js')
cookCrawler.getRecipeData(recipeUrl).then(data => {
console.log(data)
})
I think this part is ok (will still take your feedback :) ), my problem is with the structure behind. For each website that I want to parse the data from, I need to create a new parser script so to try to save some code duplication and add structure to the project I created the RecipeParser class which they extend.
class RecipeParser {
async loadHtml(url) {
this.recipeUrl = url;
try {
const recipeHtml = await requestP(url);
// Load the virtual DOM
this.$ = cheerio.load(recipeHtml);
return this;
}
catch (err) {
console.log(err);
}
}
async parseHtml(url) {
try {
await this.loadHtml(url)
return this.parse()
}
catch(err) {
console.log(err);
}
}
getTitle(selector) {
return this.whiteSpaceRemover(this.$(selector).text())
}
getRecipeInfo(selector) {
throw new Error('You have to implement the method getRecipeInfo!');
}
getIngredients(selector) {
throw new Error('You have to implement the method getIngredients!');
}
getSteps(selector) {
throw new Error('You have to implement the method getSteps!');
}
getRecipeImgUrl(selector) {
return this.$(selector).attr('href')
}
/**
* Return the obj
*/
parse() {
return {
recipeUrl: this.recipeUrl,
title: this.getTitle(),
recipeInfo: this.getRecipeInfo(),
ingredients: this.getIngredients(),
steps: this.getSteps(),
recipeImgUrl: this.getRecipeImgUrl()
}
}
getTxtArrayFromElements(selector) {
const array =
this.$(selector).each((i, element) => {
array.push(this.$(element).text())
})
return array
}
whiteSpaceRemover(string) {
return string.replace(whiteSpaceRemReg, '')
}
}
module.exports = RecipeParser
It does its job but is not very scalable if I want to add additional properties and in general, I'm not really happy with it and I'm sure there is a better way of doing it. I liked the idea of mixin but, I may be wrong, but I don't think it would be helpful in my case because each website is too unique and one logic from one cannot really apply to another. Here the function from ricardoParse.js to extract the ingredients from one of their recipes
getIngredients() {
// If the form have h3 in it, that mean the recipe have 2 recipe in it
if(this.$('#formIngredients > h3').length) {
let obj = {}
// for each recipe title link the array of ingredients to it
this.$('#formIngredients > h3').each((i, element) => {
obj[this.$(element).text()] = (() => {
const ingredients =
this.$(this.$('#formIngredients > ul ')[i]).find('li').each((j, ulElement) => {
ingredients.push(this.whiteSpaceRemover(this.$(ulElement).text()))
})
return ingredients
})()
})
return obj
}
else return this.getTxtArrayFromElements('#formIngredients ul > li > label > span')
}
The last thing is the "factory" im using would be a real pain if I have more than 10 parser in it;
class CookCrawler {
static getRecipeData(url) {
const domain = url.match(domainMatchReg).toString()
switch(domain) {
case 'https://www.ricardocuisine.com':
const ricardoParser = new RicardoParser()
return ricardoParser.parseHtml(url)
case 'https://www.troisfoisparjour.com':
const troisfoisparjourParse = new TroisfoisparjourParser()
return troisfoisparjourParse.parseHtml(url)
default:
console.warn('No parser exist for ths domain or wrong url.')
break
}
}
}
So, like the title says How can I make it more scalable and maintainable and I have done a good job? ;) I hope this is a "good question" and I look forward to any feedback and comment on my project.
J.R
javascript node.js web-scraping
New contributor
$endgroup$
Here is my try on a crawler made in nodeJs with cheerio, I made it with the idea in mind to use it in a future project I wanna make. Here is the git link: https://github.com/Just4lol/CookCrawler
If you look at the index.js you will see how to use it;
const cookCrawler = require('./cookCrawler.js')
cookCrawler.getRecipeData(recipeUrl).then(data => {
console.log(data)
})
I think this part is ok (will still take your feedback :) ), my problem is with the structure behind. For each website that I want to parse the data from, I need to create a new parser script so to try to save some code duplication and add structure to the project I created the RecipeParser class which they extend.
class RecipeParser {
async loadHtml(url) {
this.recipeUrl = url;
try {
const recipeHtml = await requestP(url);
// Load the virtual DOM
this.$ = cheerio.load(recipeHtml);
return this;
}
catch (err) {
console.log(err);
}
}
async parseHtml(url) {
try {
await this.loadHtml(url)
return this.parse()
}
catch(err) {
console.log(err);
}
}
getTitle(selector) {
return this.whiteSpaceRemover(this.$(selector).text())
}
getRecipeInfo(selector) {
throw new Error('You have to implement the method getRecipeInfo!');
}
getIngredients(selector) {
throw new Error('You have to implement the method getIngredients!');
}
getSteps(selector) {
throw new Error('You have to implement the method getSteps!');
}
getRecipeImgUrl(selector) {
return this.$(selector).attr('href')
}
/**
* Return the obj
*/
parse() {
return {
recipeUrl: this.recipeUrl,
title: this.getTitle(),
recipeInfo: this.getRecipeInfo(),
ingredients: this.getIngredients(),
steps: this.getSteps(),
recipeImgUrl: this.getRecipeImgUrl()
}
}
getTxtArrayFromElements(selector) {
const array =
this.$(selector).each((i, element) => {
array.push(this.$(element).text())
})
return array
}
whiteSpaceRemover(string) {
return string.replace(whiteSpaceRemReg, '')
}
}
module.exports = RecipeParser
It does its job but is not very scalable if I want to add additional properties and in general, I'm not really happy with it and I'm sure there is a better way of doing it. I liked the idea of mixin but, I may be wrong, but I don't think it would be helpful in my case because each website is too unique and one logic from one cannot really apply to another. Here the function from ricardoParse.js to extract the ingredients from one of their recipes
getIngredients() {
// If the form have h3 in it, that mean the recipe have 2 recipe in it
if(this.$('#formIngredients > h3').length) {
let obj = {}
// for each recipe title link the array of ingredients to it
this.$('#formIngredients > h3').each((i, element) => {
obj[this.$(element).text()] = (() => {
const ingredients =
this.$(this.$('#formIngredients > ul ')[i]).find('li').each((j, ulElement) => {
ingredients.push(this.whiteSpaceRemover(this.$(ulElement).text()))
})
return ingredients
})()
})
return obj
}
else return this.getTxtArrayFromElements('#formIngredients ul > li > label > span')
}
The last thing is the "factory" im using would be a real pain if I have more than 10 parser in it;
class CookCrawler {
static getRecipeData(url) {
const domain = url.match(domainMatchReg).toString()
switch(domain) {
case 'https://www.ricardocuisine.com':
const ricardoParser = new RicardoParser()
return ricardoParser.parseHtml(url)
case 'https://www.troisfoisparjour.com':
const troisfoisparjourParse = new TroisfoisparjourParser()
return troisfoisparjourParse.parseHtml(url)
default:
console.warn('No parser exist for ths domain or wrong url.')
break
}
}
}
So, like the title says How can I make it more scalable and maintainable and I have done a good job? ;) I hope this is a "good question" and I look forward to any feedback and comment on my project.
J.R
javascript node.js web-scraping
javascript node.js web-scraping
New contributor
New contributor
New contributor
asked 9 mins ago
Just4lolJust4lol
1
1
New contributor
New contributor
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Just4lol is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f214509%2fnodejs-crawler-how-can-i-make-it-more-scalable-and-maintainable%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Just4lol is a new contributor. Be nice, and check out our Code of Conduct.
Just4lol is a new contributor. Be nice, and check out our Code of Conduct.
Just4lol is a new contributor. Be nice, and check out our Code of Conduct.
Just4lol is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f214509%2fnodejs-crawler-how-can-i-make-it-more-scalable-and-maintainable%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown