What is happening during the “down time” when Spark is reading in big data sets on S3?
I have a bunch of JSON data in AWS S3 - let's say 100k files, each around 5MB - and I'm using Spark 2.2's DataFrameReader
to read and process them via:
sparkSession.read.json(...)
I've found that Spark will just sort of hang for 5 minutes or so before beginning the computation. This can take hours for larger data sets. When I say "hang" I mean that the terminal visualization indicating what stage the cluster is working on and how far along it is doesn't appear - as far as I can tell it is somehow in between stages.
What is Spark doing during this period, and how can I help it go faster?
I had two ideas, but both of them appear to be wrong.
My first idea was that Spark is attempting to list all of the files that it will need to do the computation. I tested this by actually creating a list of files offline and feeding them to Spark directly rather than using glob syntax:
val fileList = loadFiles()
sparkSession.read.json(fileList:_*)
This actually caused the "hanging" period to last longer!
My second idea was that Spark is using this time to create a schema for all of the data. But I ruled this out by manually specifying a schema:
val schema = createSchema()
sparksession.read.schema(schema).json(...)
Here the "hanging" period was the same as before, though the computation overall was much quicker.
So I'm not really sure what's going on or how to diagnose it. Anyone else run into this?
apache-spark
add a comment |
I have a bunch of JSON data in AWS S3 - let's say 100k files, each around 5MB - and I'm using Spark 2.2's DataFrameReader
to read and process them via:
sparkSession.read.json(...)
I've found that Spark will just sort of hang for 5 minutes or so before beginning the computation. This can take hours for larger data sets. When I say "hang" I mean that the terminal visualization indicating what stage the cluster is working on and how far along it is doesn't appear - as far as I can tell it is somehow in between stages.
What is Spark doing during this period, and how can I help it go faster?
I had two ideas, but both of them appear to be wrong.
My first idea was that Spark is attempting to list all of the files that it will need to do the computation. I tested this by actually creating a list of files offline and feeding them to Spark directly rather than using glob syntax:
val fileList = loadFiles()
sparkSession.read.json(fileList:_*)
This actually caused the "hanging" period to last longer!
My second idea was that Spark is using this time to create a schema for all of the data. But I ruled this out by manually specifying a schema:
val schema = createSchema()
sparksession.read.schema(schema).json(...)
Here the "hanging" period was the same as before, though the computation overall was much quicker.
So I'm not really sure what's going on or how to diagnose it. Anyone else run into this?
apache-spark
1
I can't really tell you what spark is doing, but I can say that spark has never been happy with reading many small files. If you have any way of aggregating the files into fewer but much bigger files (of say at least 500MB) beforehand, you should see a vast speed up.
– Glennie Helles Sindholt
Nov 21 at 9:27
@GliennieHellesSindholt Thank you for that suggestion - now that you mention it, I can think of other large datasets that I have worked with which were partitioned more efficiently, and they didn't have this problem. I remain modestly hopeful for a workaround which doesn't require restructuring the data, but maybe that's the best strategy. In your experience is there an upper limit to the file size for this purpose? Maybe it depends on the cluster?
– Paul Siegel
Nov 21 at 14:15
Well, I work with S3 which has a 5GB upper limit (or at least they used to - not sure if that has been removed actually), so my files are always smaller than 5GB. However, in all of my write job, I do make sure that I store data in few but large files for this exact reason.
– Glennie Helles Sindholt
Nov 22 at 9:17
@GlennieHellesSindholt I took your / Steve's advice, and it paid off significantly - using fewer but bigger files (about 2.6gb) the "down time" was eliminated and the job ran faster than before. Thanks!
– Paul Siegel
Nov 27 at 18:30
add a comment |
I have a bunch of JSON data in AWS S3 - let's say 100k files, each around 5MB - and I'm using Spark 2.2's DataFrameReader
to read and process them via:
sparkSession.read.json(...)
I've found that Spark will just sort of hang for 5 minutes or so before beginning the computation. This can take hours for larger data sets. When I say "hang" I mean that the terminal visualization indicating what stage the cluster is working on and how far along it is doesn't appear - as far as I can tell it is somehow in between stages.
What is Spark doing during this period, and how can I help it go faster?
I had two ideas, but both of them appear to be wrong.
My first idea was that Spark is attempting to list all of the files that it will need to do the computation. I tested this by actually creating a list of files offline and feeding them to Spark directly rather than using glob syntax:
val fileList = loadFiles()
sparkSession.read.json(fileList:_*)
This actually caused the "hanging" period to last longer!
My second idea was that Spark is using this time to create a schema for all of the data. But I ruled this out by manually specifying a schema:
val schema = createSchema()
sparksession.read.schema(schema).json(...)
Here the "hanging" period was the same as before, though the computation overall was much quicker.
So I'm not really sure what's going on or how to diagnose it. Anyone else run into this?
apache-spark
I have a bunch of JSON data in AWS S3 - let's say 100k files, each around 5MB - and I'm using Spark 2.2's DataFrameReader
to read and process them via:
sparkSession.read.json(...)
I've found that Spark will just sort of hang for 5 minutes or so before beginning the computation. This can take hours for larger data sets. When I say "hang" I mean that the terminal visualization indicating what stage the cluster is working on and how far along it is doesn't appear - as far as I can tell it is somehow in between stages.
What is Spark doing during this period, and how can I help it go faster?
I had two ideas, but both of them appear to be wrong.
My first idea was that Spark is attempting to list all of the files that it will need to do the computation. I tested this by actually creating a list of files offline and feeding them to Spark directly rather than using glob syntax:
val fileList = loadFiles()
sparkSession.read.json(fileList:_*)
This actually caused the "hanging" period to last longer!
My second idea was that Spark is using this time to create a schema for all of the data. But I ruled this out by manually specifying a schema:
val schema = createSchema()
sparksession.read.schema(schema).json(...)
Here the "hanging" period was the same as before, though the computation overall was much quicker.
So I'm not really sure what's going on or how to diagnose it. Anyone else run into this?
apache-spark
apache-spark
edited Nov 20 at 23:07
thebluephantom
2,3552925
2,3552925
asked Nov 20 at 20:10
Paul Siegel
490418
490418
1
I can't really tell you what spark is doing, but I can say that spark has never been happy with reading many small files. If you have any way of aggregating the files into fewer but much bigger files (of say at least 500MB) beforehand, you should see a vast speed up.
– Glennie Helles Sindholt
Nov 21 at 9:27
@GliennieHellesSindholt Thank you for that suggestion - now that you mention it, I can think of other large datasets that I have worked with which were partitioned more efficiently, and they didn't have this problem. I remain modestly hopeful for a workaround which doesn't require restructuring the data, but maybe that's the best strategy. In your experience is there an upper limit to the file size for this purpose? Maybe it depends on the cluster?
– Paul Siegel
Nov 21 at 14:15
Well, I work with S3 which has a 5GB upper limit (or at least they used to - not sure if that has been removed actually), so my files are always smaller than 5GB. However, in all of my write job, I do make sure that I store data in few but large files for this exact reason.
– Glennie Helles Sindholt
Nov 22 at 9:17
@GlennieHellesSindholt I took your / Steve's advice, and it paid off significantly - using fewer but bigger files (about 2.6gb) the "down time" was eliminated and the job ran faster than before. Thanks!
– Paul Siegel
Nov 27 at 18:30
add a comment |
1
I can't really tell you what spark is doing, but I can say that spark has never been happy with reading many small files. If you have any way of aggregating the files into fewer but much bigger files (of say at least 500MB) beforehand, you should see a vast speed up.
– Glennie Helles Sindholt
Nov 21 at 9:27
@GliennieHellesSindholt Thank you for that suggestion - now that you mention it, I can think of other large datasets that I have worked with which were partitioned more efficiently, and they didn't have this problem. I remain modestly hopeful for a workaround which doesn't require restructuring the data, but maybe that's the best strategy. In your experience is there an upper limit to the file size for this purpose? Maybe it depends on the cluster?
– Paul Siegel
Nov 21 at 14:15
Well, I work with S3 which has a 5GB upper limit (or at least they used to - not sure if that has been removed actually), so my files are always smaller than 5GB. However, in all of my write job, I do make sure that I store data in few but large files for this exact reason.
– Glennie Helles Sindholt
Nov 22 at 9:17
@GlennieHellesSindholt I took your / Steve's advice, and it paid off significantly - using fewer but bigger files (about 2.6gb) the "down time" was eliminated and the job ran faster than before. Thanks!
– Paul Siegel
Nov 27 at 18:30
1
1
I can't really tell you what spark is doing, but I can say that spark has never been happy with reading many small files. If you have any way of aggregating the files into fewer but much bigger files (of say at least 500MB) beforehand, you should see a vast speed up.
– Glennie Helles Sindholt
Nov 21 at 9:27
I can't really tell you what spark is doing, but I can say that spark has never been happy with reading many small files. If you have any way of aggregating the files into fewer but much bigger files (of say at least 500MB) beforehand, you should see a vast speed up.
– Glennie Helles Sindholt
Nov 21 at 9:27
@GliennieHellesSindholt Thank you for that suggestion - now that you mention it, I can think of other large datasets that I have worked with which were partitioned more efficiently, and they didn't have this problem. I remain modestly hopeful for a workaround which doesn't require restructuring the data, but maybe that's the best strategy. In your experience is there an upper limit to the file size for this purpose? Maybe it depends on the cluster?
– Paul Siegel
Nov 21 at 14:15
@GliennieHellesSindholt Thank you for that suggestion - now that you mention it, I can think of other large datasets that I have worked with which were partitioned more efficiently, and they didn't have this problem. I remain modestly hopeful for a workaround which doesn't require restructuring the data, but maybe that's the best strategy. In your experience is there an upper limit to the file size for this purpose? Maybe it depends on the cluster?
– Paul Siegel
Nov 21 at 14:15
Well, I work with S3 which has a 5GB upper limit (or at least they used to - not sure if that has been removed actually), so my files are always smaller than 5GB. However, in all of my write job, I do make sure that I store data in few but large files for this exact reason.
– Glennie Helles Sindholt
Nov 22 at 9:17
Well, I work with S3 which has a 5GB upper limit (or at least they used to - not sure if that has been removed actually), so my files are always smaller than 5GB. However, in all of my write job, I do make sure that I store data in few but large files for this exact reason.
– Glennie Helles Sindholt
Nov 22 at 9:17
@GlennieHellesSindholt I took your / Steve's advice, and it paid off significantly - using fewer but bigger files (about 2.6gb) the "down time" was eliminated and the job ran faster than before. Thanks!
– Paul Siegel
Nov 27 at 18:30
@GlennieHellesSindholt I took your / Steve's advice, and it paid off significantly - using fewer but bigger files (about 2.6gb) the "down time" was eliminated and the job ran faster than before. Thanks!
– Paul Siegel
Nov 27 at 18:30
add a comment |
1 Answer
1
active
oldest
votes
The cost of listing directory trees in S3 is very high, "partitioning". This is what you are experiencing.
Fixes
- fewer, larger files
- shallower directory tree
Thank you - this sounds plausible given the way my data is organized. However, I would have expected that feeding spark a full list of files (no globs) would save it the trouble of parsing the directory tree - in my experiments this actually made things worse. Is there some other way to cache the directory tree or something to speed up future computations? I'm doing many computations on the same data, so maybe the best strategy is to merge files together and simplify the directory structure as you suggest.
– Paul Siegel
Nov 21 at 14:07
yes, merging is good, not just for listing but for scheduling: each worker will be scheduled with all or part of a single file at a time: the bigger the files, the more work each can do before it needs to talk to the driver to commit that work and request some more
– Steve Loughran
Nov 21 at 14:55
+spark probably took the file list and rescanned it anyway, as it wouldn't know which were directories and which were files. one HEAD request per file.
– Steve Loughran
Nov 21 at 14:56
I bit the bullet and restructured the data so that the directory tree is flat and there are 1000 files instead of close to 1 million. It paid off: the "down time" was completely eliminated, and the various computations ran significantly faster. So you were right on the money. (Now to try to convince the systems team to change the way we ingest data...)
– Paul Siegel
Nov 27 at 18:28
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53400787%2fwhat-is-happening-during-the-down-time-when-spark-is-reading-in-big-data-sets%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The cost of listing directory trees in S3 is very high, "partitioning". This is what you are experiencing.
Fixes
- fewer, larger files
- shallower directory tree
Thank you - this sounds plausible given the way my data is organized. However, I would have expected that feeding spark a full list of files (no globs) would save it the trouble of parsing the directory tree - in my experiments this actually made things worse. Is there some other way to cache the directory tree or something to speed up future computations? I'm doing many computations on the same data, so maybe the best strategy is to merge files together and simplify the directory structure as you suggest.
– Paul Siegel
Nov 21 at 14:07
yes, merging is good, not just for listing but for scheduling: each worker will be scheduled with all or part of a single file at a time: the bigger the files, the more work each can do before it needs to talk to the driver to commit that work and request some more
– Steve Loughran
Nov 21 at 14:55
+spark probably took the file list and rescanned it anyway, as it wouldn't know which were directories and which were files. one HEAD request per file.
– Steve Loughran
Nov 21 at 14:56
I bit the bullet and restructured the data so that the directory tree is flat and there are 1000 files instead of close to 1 million. It paid off: the "down time" was completely eliminated, and the various computations ran significantly faster. So you were right on the money. (Now to try to convince the systems team to change the way we ingest data...)
– Paul Siegel
Nov 27 at 18:28
add a comment |
The cost of listing directory trees in S3 is very high, "partitioning". This is what you are experiencing.
Fixes
- fewer, larger files
- shallower directory tree
Thank you - this sounds plausible given the way my data is organized. However, I would have expected that feeding spark a full list of files (no globs) would save it the trouble of parsing the directory tree - in my experiments this actually made things worse. Is there some other way to cache the directory tree or something to speed up future computations? I'm doing many computations on the same data, so maybe the best strategy is to merge files together and simplify the directory structure as you suggest.
– Paul Siegel
Nov 21 at 14:07
yes, merging is good, not just for listing but for scheduling: each worker will be scheduled with all or part of a single file at a time: the bigger the files, the more work each can do before it needs to talk to the driver to commit that work and request some more
– Steve Loughran
Nov 21 at 14:55
+spark probably took the file list and rescanned it anyway, as it wouldn't know which were directories and which were files. one HEAD request per file.
– Steve Loughran
Nov 21 at 14:56
I bit the bullet and restructured the data so that the directory tree is flat and there are 1000 files instead of close to 1 million. It paid off: the "down time" was completely eliminated, and the various computations ran significantly faster. So you were right on the money. (Now to try to convince the systems team to change the way we ingest data...)
– Paul Siegel
Nov 27 at 18:28
add a comment |
The cost of listing directory trees in S3 is very high, "partitioning". This is what you are experiencing.
Fixes
- fewer, larger files
- shallower directory tree
The cost of listing directory trees in S3 is very high, "partitioning". This is what you are experiencing.
Fixes
- fewer, larger files
- shallower directory tree
answered Nov 21 at 12:50
Steve Loughran
5,09311417
5,09311417
Thank you - this sounds plausible given the way my data is organized. However, I would have expected that feeding spark a full list of files (no globs) would save it the trouble of parsing the directory tree - in my experiments this actually made things worse. Is there some other way to cache the directory tree or something to speed up future computations? I'm doing many computations on the same data, so maybe the best strategy is to merge files together and simplify the directory structure as you suggest.
– Paul Siegel
Nov 21 at 14:07
yes, merging is good, not just for listing but for scheduling: each worker will be scheduled with all or part of a single file at a time: the bigger the files, the more work each can do before it needs to talk to the driver to commit that work and request some more
– Steve Loughran
Nov 21 at 14:55
+spark probably took the file list and rescanned it anyway, as it wouldn't know which were directories and which were files. one HEAD request per file.
– Steve Loughran
Nov 21 at 14:56
I bit the bullet and restructured the data so that the directory tree is flat and there are 1000 files instead of close to 1 million. It paid off: the "down time" was completely eliminated, and the various computations ran significantly faster. So you were right on the money. (Now to try to convince the systems team to change the way we ingest data...)
– Paul Siegel
Nov 27 at 18:28
add a comment |
Thank you - this sounds plausible given the way my data is organized. However, I would have expected that feeding spark a full list of files (no globs) would save it the trouble of parsing the directory tree - in my experiments this actually made things worse. Is there some other way to cache the directory tree or something to speed up future computations? I'm doing many computations on the same data, so maybe the best strategy is to merge files together and simplify the directory structure as you suggest.
– Paul Siegel
Nov 21 at 14:07
yes, merging is good, not just for listing but for scheduling: each worker will be scheduled with all or part of a single file at a time: the bigger the files, the more work each can do before it needs to talk to the driver to commit that work and request some more
– Steve Loughran
Nov 21 at 14:55
+spark probably took the file list and rescanned it anyway, as it wouldn't know which were directories and which were files. one HEAD request per file.
– Steve Loughran
Nov 21 at 14:56
I bit the bullet and restructured the data so that the directory tree is flat and there are 1000 files instead of close to 1 million. It paid off: the "down time" was completely eliminated, and the various computations ran significantly faster. So you were right on the money. (Now to try to convince the systems team to change the way we ingest data...)
– Paul Siegel
Nov 27 at 18:28
Thank you - this sounds plausible given the way my data is organized. However, I would have expected that feeding spark a full list of files (no globs) would save it the trouble of parsing the directory tree - in my experiments this actually made things worse. Is there some other way to cache the directory tree or something to speed up future computations? I'm doing many computations on the same data, so maybe the best strategy is to merge files together and simplify the directory structure as you suggest.
– Paul Siegel
Nov 21 at 14:07
Thank you - this sounds plausible given the way my data is organized. However, I would have expected that feeding spark a full list of files (no globs) would save it the trouble of parsing the directory tree - in my experiments this actually made things worse. Is there some other way to cache the directory tree or something to speed up future computations? I'm doing many computations on the same data, so maybe the best strategy is to merge files together and simplify the directory structure as you suggest.
– Paul Siegel
Nov 21 at 14:07
yes, merging is good, not just for listing but for scheduling: each worker will be scheduled with all or part of a single file at a time: the bigger the files, the more work each can do before it needs to talk to the driver to commit that work and request some more
– Steve Loughran
Nov 21 at 14:55
yes, merging is good, not just for listing but for scheduling: each worker will be scheduled with all or part of a single file at a time: the bigger the files, the more work each can do before it needs to talk to the driver to commit that work and request some more
– Steve Loughran
Nov 21 at 14:55
+spark probably took the file list and rescanned it anyway, as it wouldn't know which were directories and which were files. one HEAD request per file.
– Steve Loughran
Nov 21 at 14:56
+spark probably took the file list and rescanned it anyway, as it wouldn't know which were directories and which were files. one HEAD request per file.
– Steve Loughran
Nov 21 at 14:56
I bit the bullet and restructured the data so that the directory tree is flat and there are 1000 files instead of close to 1 million. It paid off: the "down time" was completely eliminated, and the various computations ran significantly faster. So you were right on the money. (Now to try to convince the systems team to change the way we ingest data...)
– Paul Siegel
Nov 27 at 18:28
I bit the bullet and restructured the data so that the directory tree is flat and there are 1000 files instead of close to 1 million. It paid off: the "down time" was completely eliminated, and the various computations ran significantly faster. So you were right on the money. (Now to try to convince the systems team to change the way we ingest data...)
– Paul Siegel
Nov 27 at 18:28
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53400787%2fwhat-is-happening-during-the-down-time-when-spark-is-reading-in-big-data-sets%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
I can't really tell you what spark is doing, but I can say that spark has never been happy with reading many small files. If you have any way of aggregating the files into fewer but much bigger files (of say at least 500MB) beforehand, you should see a vast speed up.
– Glennie Helles Sindholt
Nov 21 at 9:27
@GliennieHellesSindholt Thank you for that suggestion - now that you mention it, I can think of other large datasets that I have worked with which were partitioned more efficiently, and they didn't have this problem. I remain modestly hopeful for a workaround which doesn't require restructuring the data, but maybe that's the best strategy. In your experience is there an upper limit to the file size for this purpose? Maybe it depends on the cluster?
– Paul Siegel
Nov 21 at 14:15
Well, I work with S3 which has a 5GB upper limit (or at least they used to - not sure if that has been removed actually), so my files are always smaller than 5GB. However, in all of my write job, I do make sure that I store data in few but large files for this exact reason.
– Glennie Helles Sindholt
Nov 22 at 9:17
@GlennieHellesSindholt I took your / Steve's advice, and it paid off significantly - using fewer but bigger files (about 2.6gb) the "down time" was eliminated and the job ran faster than before. Thanks!
– Paul Siegel
Nov 27 at 18:30