Some rows lost when creating Whoosh index in Python












0















I'm trying to load data into Whoosh search engine. There are ~7 000 000 rows to insert. First I get all the rows with psycopg2 from PostgreSQL database, then I insert them with writer.add_document(some_data_here) into Whoosh index. My writer object is as follows:



writer = index.writer(limitmb=1024,procs=5,multisegment=True)


The problem is that executing index.searcher().documents() (which is supposed to return all the documents in index) returns significantly smaller amount of rows - around 5 000 000. I can confirm it with another query, simply searching for a Term that fits every record - I get identical result (around 5 mln).



I thought this might be some Python concurrency or memory issue, so I tried to load it in bulks - I divided data into equal blocks of 500 000 records, but with no luck, still getting a lower amount. I also tried playing with writer's parameters, again without success.



I discovered the issue when trying to search for a record that I exactly knew needs to exist - it didn't. I'm running a 16 GB RAM + 6 CPU server, so resources shouldn't be an issue.










share|improve this question

























  • Not a direct answer to your Q, but if you've got 7 million rows in postgres, why are you using whoosh at all? You'd be better off applying a FTI in postgres than resorting to whoosh... I don't think it's really meant to be used at that scale.

    – Jon Clements
    Nov 25 '18 at 22:31













  • Hi, thanks for your opinion - I'm dealing with existing software that needed minor updates and encountered this "bug". About the amount - I saw others doing even more and not reporting issues, so I believe there is a way.

    – adamczi
    Nov 25 '18 at 22:42











  • Yeah sure... I just personally wouldn't use Whoosh for this if you've already got a perfectly good DB backend that can do it (and not in Python). You have forced a write to the index I take it and checked that you are indeed requesting the entire 7m to be written to it?

    – Jon Clements
    Nov 25 '18 at 22:46













  • That is correct, I'm taking entire postgres table and insert it to Whoosh, at once or in bulks.

    – adamczi
    Nov 25 '18 at 22:48











  • Umm okies... I've not got any ideas I'm afraid... never had what you're describing myself... good luck in solving it.

    – Jon Clements
    Nov 25 '18 at 22:52
















0















I'm trying to load data into Whoosh search engine. There are ~7 000 000 rows to insert. First I get all the rows with psycopg2 from PostgreSQL database, then I insert them with writer.add_document(some_data_here) into Whoosh index. My writer object is as follows:



writer = index.writer(limitmb=1024,procs=5,multisegment=True)


The problem is that executing index.searcher().documents() (which is supposed to return all the documents in index) returns significantly smaller amount of rows - around 5 000 000. I can confirm it with another query, simply searching for a Term that fits every record - I get identical result (around 5 mln).



I thought this might be some Python concurrency or memory issue, so I tried to load it in bulks - I divided data into equal blocks of 500 000 records, but with no luck, still getting a lower amount. I also tried playing with writer's parameters, again without success.



I discovered the issue when trying to search for a record that I exactly knew needs to exist - it didn't. I'm running a 16 GB RAM + 6 CPU server, so resources shouldn't be an issue.










share|improve this question

























  • Not a direct answer to your Q, but if you've got 7 million rows in postgres, why are you using whoosh at all? You'd be better off applying a FTI in postgres than resorting to whoosh... I don't think it's really meant to be used at that scale.

    – Jon Clements
    Nov 25 '18 at 22:31













  • Hi, thanks for your opinion - I'm dealing with existing software that needed minor updates and encountered this "bug". About the amount - I saw others doing even more and not reporting issues, so I believe there is a way.

    – adamczi
    Nov 25 '18 at 22:42











  • Yeah sure... I just personally wouldn't use Whoosh for this if you've already got a perfectly good DB backend that can do it (and not in Python). You have forced a write to the index I take it and checked that you are indeed requesting the entire 7m to be written to it?

    – Jon Clements
    Nov 25 '18 at 22:46













  • That is correct, I'm taking entire postgres table and insert it to Whoosh, at once or in bulks.

    – adamczi
    Nov 25 '18 at 22:48











  • Umm okies... I've not got any ideas I'm afraid... never had what you're describing myself... good luck in solving it.

    – Jon Clements
    Nov 25 '18 at 22:52














0












0








0


1






I'm trying to load data into Whoosh search engine. There are ~7 000 000 rows to insert. First I get all the rows with psycopg2 from PostgreSQL database, then I insert them with writer.add_document(some_data_here) into Whoosh index. My writer object is as follows:



writer = index.writer(limitmb=1024,procs=5,multisegment=True)


The problem is that executing index.searcher().documents() (which is supposed to return all the documents in index) returns significantly smaller amount of rows - around 5 000 000. I can confirm it with another query, simply searching for a Term that fits every record - I get identical result (around 5 mln).



I thought this might be some Python concurrency or memory issue, so I tried to load it in bulks - I divided data into equal blocks of 500 000 records, but with no luck, still getting a lower amount. I also tried playing with writer's parameters, again without success.



I discovered the issue when trying to search for a record that I exactly knew needs to exist - it didn't. I'm running a 16 GB RAM + 6 CPU server, so resources shouldn't be an issue.










share|improve this question
















I'm trying to load data into Whoosh search engine. There are ~7 000 000 rows to insert. First I get all the rows with psycopg2 from PostgreSQL database, then I insert them with writer.add_document(some_data_here) into Whoosh index. My writer object is as follows:



writer = index.writer(limitmb=1024,procs=5,multisegment=True)


The problem is that executing index.searcher().documents() (which is supposed to return all the documents in index) returns significantly smaller amount of rows - around 5 000 000. I can confirm it with another query, simply searching for a Term that fits every record - I get identical result (around 5 mln).



I thought this might be some Python concurrency or memory issue, so I tried to load it in bulks - I divided data into equal blocks of 500 000 records, but with no luck, still getting a lower amount. I also tried playing with writer's parameters, again without success.



I discovered the issue when trying to search for a record that I exactly knew needs to exist - it didn't. I'm running a 16 GB RAM + 6 CPU server, so resources shouldn't be an issue.







python search-engine whoosh






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 25 '18 at 22:31







adamczi

















asked Nov 25 '18 at 22:28









adamcziadamczi

201515




201515













  • Not a direct answer to your Q, but if you've got 7 million rows in postgres, why are you using whoosh at all? You'd be better off applying a FTI in postgres than resorting to whoosh... I don't think it's really meant to be used at that scale.

    – Jon Clements
    Nov 25 '18 at 22:31













  • Hi, thanks for your opinion - I'm dealing with existing software that needed minor updates and encountered this "bug". About the amount - I saw others doing even more and not reporting issues, so I believe there is a way.

    – adamczi
    Nov 25 '18 at 22:42











  • Yeah sure... I just personally wouldn't use Whoosh for this if you've already got a perfectly good DB backend that can do it (and not in Python). You have forced a write to the index I take it and checked that you are indeed requesting the entire 7m to be written to it?

    – Jon Clements
    Nov 25 '18 at 22:46













  • That is correct, I'm taking entire postgres table and insert it to Whoosh, at once or in bulks.

    – adamczi
    Nov 25 '18 at 22:48











  • Umm okies... I've not got any ideas I'm afraid... never had what you're describing myself... good luck in solving it.

    – Jon Clements
    Nov 25 '18 at 22:52



















  • Not a direct answer to your Q, but if you've got 7 million rows in postgres, why are you using whoosh at all? You'd be better off applying a FTI in postgres than resorting to whoosh... I don't think it's really meant to be used at that scale.

    – Jon Clements
    Nov 25 '18 at 22:31













  • Hi, thanks for your opinion - I'm dealing with existing software that needed minor updates and encountered this "bug". About the amount - I saw others doing even more and not reporting issues, so I believe there is a way.

    – adamczi
    Nov 25 '18 at 22:42











  • Yeah sure... I just personally wouldn't use Whoosh for this if you've already got a perfectly good DB backend that can do it (and not in Python). You have forced a write to the index I take it and checked that you are indeed requesting the entire 7m to be written to it?

    – Jon Clements
    Nov 25 '18 at 22:46













  • That is correct, I'm taking entire postgres table and insert it to Whoosh, at once or in bulks.

    – adamczi
    Nov 25 '18 at 22:48











  • Umm okies... I've not got any ideas I'm afraid... never had what you're describing myself... good luck in solving it.

    – Jon Clements
    Nov 25 '18 at 22:52

















Not a direct answer to your Q, but if you've got 7 million rows in postgres, why are you using whoosh at all? You'd be better off applying a FTI in postgres than resorting to whoosh... I don't think it's really meant to be used at that scale.

– Jon Clements
Nov 25 '18 at 22:31







Not a direct answer to your Q, but if you've got 7 million rows in postgres, why are you using whoosh at all? You'd be better off applying a FTI in postgres than resorting to whoosh... I don't think it's really meant to be used at that scale.

– Jon Clements
Nov 25 '18 at 22:31















Hi, thanks for your opinion - I'm dealing with existing software that needed minor updates and encountered this "bug". About the amount - I saw others doing even more and not reporting issues, so I believe there is a way.

– adamczi
Nov 25 '18 at 22:42





Hi, thanks for your opinion - I'm dealing with existing software that needed minor updates and encountered this "bug". About the amount - I saw others doing even more and not reporting issues, so I believe there is a way.

– adamczi
Nov 25 '18 at 22:42













Yeah sure... I just personally wouldn't use Whoosh for this if you've already got a perfectly good DB backend that can do it (and not in Python). You have forced a write to the index I take it and checked that you are indeed requesting the entire 7m to be written to it?

– Jon Clements
Nov 25 '18 at 22:46







Yeah sure... I just personally wouldn't use Whoosh for this if you've already got a perfectly good DB backend that can do it (and not in Python). You have forced a write to the index I take it and checked that you are indeed requesting the entire 7m to be written to it?

– Jon Clements
Nov 25 '18 at 22:46















That is correct, I'm taking entire postgres table and insert it to Whoosh, at once or in bulks.

– adamczi
Nov 25 '18 at 22:48





That is correct, I'm taking entire postgres table and insert it to Whoosh, at once or in bulks.

– adamczi
Nov 25 '18 at 22:48













Umm okies... I've not got any ideas I'm afraid... never had what you're describing myself... good luck in solving it.

– Jon Clements
Nov 25 '18 at 22:52





Umm okies... I've not got any ideas I'm afraid... never had what you're describing myself... good luck in solving it.

– Jon Clements
Nov 25 '18 at 22:52












0






active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53472659%2fsome-rows-lost-when-creating-whoosh-index-in-python%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53472659%2fsome-rows-lost-when-creating-whoosh-index-in-python%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

404 Error Contact Form 7 ajax form submitting

How to know if a Active Directory user can login interactively

TypeError: fit_transform() missing 1 required positional argument: 'X'