Delete 500 million documents from a mongo collection containing 18 billion documents
We are trying to do some cleanup from a mongo collection which has 18 billion documents and we need to remove around 500 million documents out of them. Despite using an indexed query to cleanup the data in batches of 1000 each and using bulk operation, I find the execution being painfully slow. Could someone please help with a strategy to do such cleanup? I am happy to provide more information if needed.
We are using a Scala process comprising of 8 threads, each handling a batch of 1000 to do the cleanup.
mongodb
add a comment |
We are trying to do some cleanup from a mongo collection which has 18 billion documents and we need to remove around 500 million documents out of them. Despite using an indexed query to cleanup the data in batches of 1000 each and using bulk operation, I find the execution being painfully slow. Could someone please help with a strategy to do such cleanup? I am happy to provide more information if needed.
We are using a Scala process comprising of 8 threads, each handling a batch of 1000 to do the cleanup.
mongodb
Small batches and indexes is what I was going to suggest. Is your hardware simply not up to the task? Do you know what is taking all time? Is it pagefaults?
– Sergio Tulentsev
Nov 24 '18 at 8:54
I have checked the swap space and it is unused. On checking the free memory on the host, it has got 100G of cached memory which can be used up when required. It doesn't seem to be a memory issue by the look of it and seems to have something to do with the way mongo executes the delete operations. The host I am currently running this on has a spinning disk but I haven't seen a considerable speed up happening on ssd's either.
– Anish Gupta
Nov 24 '18 at 9:06
add a comment |
We are trying to do some cleanup from a mongo collection which has 18 billion documents and we need to remove around 500 million documents out of them. Despite using an indexed query to cleanup the data in batches of 1000 each and using bulk operation, I find the execution being painfully slow. Could someone please help with a strategy to do such cleanup? I am happy to provide more information if needed.
We are using a Scala process comprising of 8 threads, each handling a batch of 1000 to do the cleanup.
mongodb
We are trying to do some cleanup from a mongo collection which has 18 billion documents and we need to remove around 500 million documents out of them. Despite using an indexed query to cleanup the data in batches of 1000 each and using bulk operation, I find the execution being painfully slow. Could someone please help with a strategy to do such cleanup? I am happy to provide more information if needed.
We are using a Scala process comprising of 8 threads, each handling a batch of 1000 to do the cleanup.
mongodb
mongodb
asked Nov 24 '18 at 8:48
Anish GuptaAnish Gupta
103216
103216
Small batches and indexes is what I was going to suggest. Is your hardware simply not up to the task? Do you know what is taking all time? Is it pagefaults?
– Sergio Tulentsev
Nov 24 '18 at 8:54
I have checked the swap space and it is unused. On checking the free memory on the host, it has got 100G of cached memory which can be used up when required. It doesn't seem to be a memory issue by the look of it and seems to have something to do with the way mongo executes the delete operations. The host I am currently running this on has a spinning disk but I haven't seen a considerable speed up happening on ssd's either.
– Anish Gupta
Nov 24 '18 at 9:06
add a comment |
Small batches and indexes is what I was going to suggest. Is your hardware simply not up to the task? Do you know what is taking all time? Is it pagefaults?
– Sergio Tulentsev
Nov 24 '18 at 8:54
I have checked the swap space and it is unused. On checking the free memory on the host, it has got 100G of cached memory which can be used up when required. It doesn't seem to be a memory issue by the look of it and seems to have something to do with the way mongo executes the delete operations. The host I am currently running this on has a spinning disk but I haven't seen a considerable speed up happening on ssd's either.
– Anish Gupta
Nov 24 '18 at 9:06
Small batches and indexes is what I was going to suggest. Is your hardware simply not up to the task? Do you know what is taking all time? Is it pagefaults?
– Sergio Tulentsev
Nov 24 '18 at 8:54
Small batches and indexes is what I was going to suggest. Is your hardware simply not up to the task? Do you know what is taking all time? Is it pagefaults?
– Sergio Tulentsev
Nov 24 '18 at 8:54
I have checked the swap space and it is unused. On checking the free memory on the host, it has got 100G of cached memory which can be used up when required. It doesn't seem to be a memory issue by the look of it and seems to have something to do with the way mongo executes the delete operations. The host I am currently running this on has a spinning disk but I haven't seen a considerable speed up happening on ssd's either.
– Anish Gupta
Nov 24 '18 at 9:06
I have checked the swap space and it is unused. On checking the free memory on the host, it has got 100G of cached memory which can be used up when required. It doesn't seem to be a memory issue by the look of it and seems to have something to do with the way mongo executes the delete operations. The host I am currently running this on has a spinning disk but I haven't seen a considerable speed up happening on ssd's either.
– Anish Gupta
Nov 24 '18 at 9:06
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53456602%2fdelete-500-million-documents-from-a-mongo-collection-containing-18-billion-docum%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53456602%2fdelete-500-million-documents-from-a-mongo-collection-containing-18-billion-docum%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Small batches and indexes is what I was going to suggest. Is your hardware simply not up to the task? Do you know what is taking all time? Is it pagefaults?
– Sergio Tulentsev
Nov 24 '18 at 8:54
I have checked the swap space and it is unused. On checking the free memory on the host, it has got 100G of cached memory which can be used up when required. It doesn't seem to be a memory issue by the look of it and seems to have something to do with the way mongo executes the delete operations. The host I am currently running this on has a spinning disk but I haven't seen a considerable speed up happening on ssd's either.
– Anish Gupta
Nov 24 '18 at 9:06