Some rows lost when creating Whoosh index in Python
I'm trying to load data into Whoosh search engine. There are ~7 000 000 rows to insert. First I get all the rows with psycopg2
from PostgreSQL database, then I insert them with writer.add_document(some_data_here)
into Whoosh index. My writer
object is as follows:
writer = index.writer(limitmb=1024,procs=5,multisegment=True)
The problem is that executing index.searcher().documents()
(which is supposed to return all the documents in index) returns significantly smaller amount of rows - around 5 000 000. I can confirm it with another query, simply searching for a Term
that fits every record - I get identical result (around 5 mln).
I thought this might be some Python concurrency or memory issue, so I tried to load it in bulks - I divided data into equal blocks of 500 000 records, but with no luck, still getting a lower amount. I also tried playing with writer
's parameters, again without success.
I discovered the issue when trying to search for a record that I exactly knew needs to exist - it didn't. I'm running a 16 GB RAM + 6 CPU server, so resources shouldn't be an issue.
python search-engine whoosh
|
show 2 more comments
I'm trying to load data into Whoosh search engine. There are ~7 000 000 rows to insert. First I get all the rows with psycopg2
from PostgreSQL database, then I insert them with writer.add_document(some_data_here)
into Whoosh index. My writer
object is as follows:
writer = index.writer(limitmb=1024,procs=5,multisegment=True)
The problem is that executing index.searcher().documents()
(which is supposed to return all the documents in index) returns significantly smaller amount of rows - around 5 000 000. I can confirm it with another query, simply searching for a Term
that fits every record - I get identical result (around 5 mln).
I thought this might be some Python concurrency or memory issue, so I tried to load it in bulks - I divided data into equal blocks of 500 000 records, but with no luck, still getting a lower amount. I also tried playing with writer
's parameters, again without success.
I discovered the issue when trying to search for a record that I exactly knew needs to exist - it didn't. I'm running a 16 GB RAM + 6 CPU server, so resources shouldn't be an issue.
python search-engine whoosh
Not a direct answer to your Q, but if you've got 7 million rows in postgres, why are you using whoosh at all? You'd be better off applying a FTI in postgres than resorting to whoosh... I don't think it's really meant to be used at that scale.
– Jon Clements♦
Nov 25 '18 at 22:31
Hi, thanks for your opinion - I'm dealing with existing software that needed minor updates and encountered this "bug". About the amount - I saw others doing even more and not reporting issues, so I believe there is a way.
– adamczi
Nov 25 '18 at 22:42
Yeah sure... I just personally wouldn't use Whoosh for this if you've already got a perfectly good DB backend that can do it (and not in Python). You have forced a write to the index I take it and checked that you are indeed requesting the entire 7m to be written to it?
– Jon Clements♦
Nov 25 '18 at 22:46
That is correct, I'm taking entire postgres table and insert it to Whoosh, at once or in bulks.
– adamczi
Nov 25 '18 at 22:48
Umm okies... I've not got any ideas I'm afraid... never had what you're describing myself... good luck in solving it.
– Jon Clements♦
Nov 25 '18 at 22:52
|
show 2 more comments
I'm trying to load data into Whoosh search engine. There are ~7 000 000 rows to insert. First I get all the rows with psycopg2
from PostgreSQL database, then I insert them with writer.add_document(some_data_here)
into Whoosh index. My writer
object is as follows:
writer = index.writer(limitmb=1024,procs=5,multisegment=True)
The problem is that executing index.searcher().documents()
(which is supposed to return all the documents in index) returns significantly smaller amount of rows - around 5 000 000. I can confirm it with another query, simply searching for a Term
that fits every record - I get identical result (around 5 mln).
I thought this might be some Python concurrency or memory issue, so I tried to load it in bulks - I divided data into equal blocks of 500 000 records, but with no luck, still getting a lower amount. I also tried playing with writer
's parameters, again without success.
I discovered the issue when trying to search for a record that I exactly knew needs to exist - it didn't. I'm running a 16 GB RAM + 6 CPU server, so resources shouldn't be an issue.
python search-engine whoosh
I'm trying to load data into Whoosh search engine. There are ~7 000 000 rows to insert. First I get all the rows with psycopg2
from PostgreSQL database, then I insert them with writer.add_document(some_data_here)
into Whoosh index. My writer
object is as follows:
writer = index.writer(limitmb=1024,procs=5,multisegment=True)
The problem is that executing index.searcher().documents()
(which is supposed to return all the documents in index) returns significantly smaller amount of rows - around 5 000 000. I can confirm it with another query, simply searching for a Term
that fits every record - I get identical result (around 5 mln).
I thought this might be some Python concurrency or memory issue, so I tried to load it in bulks - I divided data into equal blocks of 500 000 records, but with no luck, still getting a lower amount. I also tried playing with writer
's parameters, again without success.
I discovered the issue when trying to search for a record that I exactly knew needs to exist - it didn't. I'm running a 16 GB RAM + 6 CPU server, so resources shouldn't be an issue.
python search-engine whoosh
python search-engine whoosh
edited Nov 25 '18 at 22:31
adamczi
asked Nov 25 '18 at 22:28
adamcziadamczi
201515
201515
Not a direct answer to your Q, but if you've got 7 million rows in postgres, why are you using whoosh at all? You'd be better off applying a FTI in postgres than resorting to whoosh... I don't think it's really meant to be used at that scale.
– Jon Clements♦
Nov 25 '18 at 22:31
Hi, thanks for your opinion - I'm dealing with existing software that needed minor updates and encountered this "bug". About the amount - I saw others doing even more and not reporting issues, so I believe there is a way.
– adamczi
Nov 25 '18 at 22:42
Yeah sure... I just personally wouldn't use Whoosh for this if you've already got a perfectly good DB backend that can do it (and not in Python). You have forced a write to the index I take it and checked that you are indeed requesting the entire 7m to be written to it?
– Jon Clements♦
Nov 25 '18 at 22:46
That is correct, I'm taking entire postgres table and insert it to Whoosh, at once or in bulks.
– adamczi
Nov 25 '18 at 22:48
Umm okies... I've not got any ideas I'm afraid... never had what you're describing myself... good luck in solving it.
– Jon Clements♦
Nov 25 '18 at 22:52
|
show 2 more comments
Not a direct answer to your Q, but if you've got 7 million rows in postgres, why are you using whoosh at all? You'd be better off applying a FTI in postgres than resorting to whoosh... I don't think it's really meant to be used at that scale.
– Jon Clements♦
Nov 25 '18 at 22:31
Hi, thanks for your opinion - I'm dealing with existing software that needed minor updates and encountered this "bug". About the amount - I saw others doing even more and not reporting issues, so I believe there is a way.
– adamczi
Nov 25 '18 at 22:42
Yeah sure... I just personally wouldn't use Whoosh for this if you've already got a perfectly good DB backend that can do it (and not in Python). You have forced a write to the index I take it and checked that you are indeed requesting the entire 7m to be written to it?
– Jon Clements♦
Nov 25 '18 at 22:46
That is correct, I'm taking entire postgres table and insert it to Whoosh, at once or in bulks.
– adamczi
Nov 25 '18 at 22:48
Umm okies... I've not got any ideas I'm afraid... never had what you're describing myself... good luck in solving it.
– Jon Clements♦
Nov 25 '18 at 22:52
Not a direct answer to your Q, but if you've got 7 million rows in postgres, why are you using whoosh at all? You'd be better off applying a FTI in postgres than resorting to whoosh... I don't think it's really meant to be used at that scale.
– Jon Clements♦
Nov 25 '18 at 22:31
Not a direct answer to your Q, but if you've got 7 million rows in postgres, why are you using whoosh at all? You'd be better off applying a FTI in postgres than resorting to whoosh... I don't think it's really meant to be used at that scale.
– Jon Clements♦
Nov 25 '18 at 22:31
Hi, thanks for your opinion - I'm dealing with existing software that needed minor updates and encountered this "bug". About the amount - I saw others doing even more and not reporting issues, so I believe there is a way.
– adamczi
Nov 25 '18 at 22:42
Hi, thanks for your opinion - I'm dealing with existing software that needed minor updates and encountered this "bug". About the amount - I saw others doing even more and not reporting issues, so I believe there is a way.
– adamczi
Nov 25 '18 at 22:42
Yeah sure... I just personally wouldn't use Whoosh for this if you've already got a perfectly good DB backend that can do it (and not in Python). You have forced a write to the index I take it and checked that you are indeed requesting the entire 7m to be written to it?
– Jon Clements♦
Nov 25 '18 at 22:46
Yeah sure... I just personally wouldn't use Whoosh for this if you've already got a perfectly good DB backend that can do it (and not in Python). You have forced a write to the index I take it and checked that you are indeed requesting the entire 7m to be written to it?
– Jon Clements♦
Nov 25 '18 at 22:46
That is correct, I'm taking entire postgres table and insert it to Whoosh, at once or in bulks.
– adamczi
Nov 25 '18 at 22:48
That is correct, I'm taking entire postgres table and insert it to Whoosh, at once or in bulks.
– adamczi
Nov 25 '18 at 22:48
Umm okies... I've not got any ideas I'm afraid... never had what you're describing myself... good luck in solving it.
– Jon Clements♦
Nov 25 '18 at 22:52
Umm okies... I've not got any ideas I'm afraid... never had what you're describing myself... good luck in solving it.
– Jon Clements♦
Nov 25 '18 at 22:52
|
show 2 more comments
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53472659%2fsome-rows-lost-when-creating-whoosh-index-in-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53472659%2fsome-rows-lost-when-creating-whoosh-index-in-python%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Not a direct answer to your Q, but if you've got 7 million rows in postgres, why are you using whoosh at all? You'd be better off applying a FTI in postgres than resorting to whoosh... I don't think it's really meant to be used at that scale.
– Jon Clements♦
Nov 25 '18 at 22:31
Hi, thanks for your opinion - I'm dealing with existing software that needed minor updates and encountered this "bug". About the amount - I saw others doing even more and not reporting issues, so I believe there is a way.
– adamczi
Nov 25 '18 at 22:42
Yeah sure... I just personally wouldn't use Whoosh for this if you've already got a perfectly good DB backend that can do it (and not in Python). You have forced a write to the index I take it and checked that you are indeed requesting the entire 7m to be written to it?
– Jon Clements♦
Nov 25 '18 at 22:46
That is correct, I'm taking entire postgres table and insert it to Whoosh, at once or in bulks.
– adamczi
Nov 25 '18 at 22:48
Umm okies... I've not got any ideas I'm afraid... never had what you're describing myself... good luck in solving it.
– Jon Clements♦
Nov 25 '18 at 22:52