What is happening during the “down time” when Spark is reading in big data sets on S3?












1














I have a bunch of JSON data in AWS S3 - let's say 100k files, each around 5MB - and I'm using Spark 2.2's DataFrameReader to read and process them via:



sparkSession.read.json(...)



I've found that Spark will just sort of hang for 5 minutes or so before beginning the computation. This can take hours for larger data sets. When I say "hang" I mean that the terminal visualization indicating what stage the cluster is working on and how far along it is doesn't appear - as far as I can tell it is somehow in between stages.




What is Spark doing during this period, and how can I help it go faster?




I had two ideas, but both of them appear to be wrong.



My first idea was that Spark is attempting to list all of the files that it will need to do the computation. I tested this by actually creating a list of files offline and feeding them to Spark directly rather than using glob syntax:



val fileList = loadFiles()
sparkSession.read.json(fileList:_*)



This actually caused the "hanging" period to last longer!



My second idea was that Spark is using this time to create a schema for all of the data. But I ruled this out by manually specifying a schema:



val schema = createSchema()
sparksession.read.schema(schema).json(...)



Here the "hanging" period was the same as before, though the computation overall was much quicker.



So I'm not really sure what's going on or how to diagnose it. Anyone else run into this?










share|improve this question




















  • 1




    I can't really tell you what spark is doing, but I can say that spark has never been happy with reading many small files. If you have any way of aggregating the files into fewer but much bigger files (of say at least 500MB) beforehand, you should see a vast speed up.
    – Glennie Helles Sindholt
    Nov 21 at 9:27










  • @GliennieHellesSindholt Thank you for that suggestion - now that you mention it, I can think of other large datasets that I have worked with which were partitioned more efficiently, and they didn't have this problem. I remain modestly hopeful for a workaround which doesn't require restructuring the data, but maybe that's the best strategy. In your experience is there an upper limit to the file size for this purpose? Maybe it depends on the cluster?
    – Paul Siegel
    Nov 21 at 14:15












  • Well, I work with S3 which has a 5GB upper limit (or at least they used to - not sure if that has been removed actually), so my files are always smaller than 5GB. However, in all of my write job, I do make sure that I store data in few but large files for this exact reason.
    – Glennie Helles Sindholt
    Nov 22 at 9:17










  • @GlennieHellesSindholt I took your / Steve's advice, and it paid off significantly - using fewer but bigger files (about 2.6gb) the "down time" was eliminated and the job ran faster than before. Thanks!
    – Paul Siegel
    Nov 27 at 18:30
















1














I have a bunch of JSON data in AWS S3 - let's say 100k files, each around 5MB - and I'm using Spark 2.2's DataFrameReader to read and process them via:



sparkSession.read.json(...)



I've found that Spark will just sort of hang for 5 minutes or so before beginning the computation. This can take hours for larger data sets. When I say "hang" I mean that the terminal visualization indicating what stage the cluster is working on and how far along it is doesn't appear - as far as I can tell it is somehow in between stages.




What is Spark doing during this period, and how can I help it go faster?




I had two ideas, but both of them appear to be wrong.



My first idea was that Spark is attempting to list all of the files that it will need to do the computation. I tested this by actually creating a list of files offline and feeding them to Spark directly rather than using glob syntax:



val fileList = loadFiles()
sparkSession.read.json(fileList:_*)



This actually caused the "hanging" period to last longer!



My second idea was that Spark is using this time to create a schema for all of the data. But I ruled this out by manually specifying a schema:



val schema = createSchema()
sparksession.read.schema(schema).json(...)



Here the "hanging" period was the same as before, though the computation overall was much quicker.



So I'm not really sure what's going on or how to diagnose it. Anyone else run into this?










share|improve this question




















  • 1




    I can't really tell you what spark is doing, but I can say that spark has never been happy with reading many small files. If you have any way of aggregating the files into fewer but much bigger files (of say at least 500MB) beforehand, you should see a vast speed up.
    – Glennie Helles Sindholt
    Nov 21 at 9:27










  • @GliennieHellesSindholt Thank you for that suggestion - now that you mention it, I can think of other large datasets that I have worked with which were partitioned more efficiently, and they didn't have this problem. I remain modestly hopeful for a workaround which doesn't require restructuring the data, but maybe that's the best strategy. In your experience is there an upper limit to the file size for this purpose? Maybe it depends on the cluster?
    – Paul Siegel
    Nov 21 at 14:15












  • Well, I work with S3 which has a 5GB upper limit (or at least they used to - not sure if that has been removed actually), so my files are always smaller than 5GB. However, in all of my write job, I do make sure that I store data in few but large files for this exact reason.
    – Glennie Helles Sindholt
    Nov 22 at 9:17










  • @GlennieHellesSindholt I took your / Steve's advice, and it paid off significantly - using fewer but bigger files (about 2.6gb) the "down time" was eliminated and the job ran faster than before. Thanks!
    – Paul Siegel
    Nov 27 at 18:30














1












1








1







I have a bunch of JSON data in AWS S3 - let's say 100k files, each around 5MB - and I'm using Spark 2.2's DataFrameReader to read and process them via:



sparkSession.read.json(...)



I've found that Spark will just sort of hang for 5 minutes or so before beginning the computation. This can take hours for larger data sets. When I say "hang" I mean that the terminal visualization indicating what stage the cluster is working on and how far along it is doesn't appear - as far as I can tell it is somehow in between stages.




What is Spark doing during this period, and how can I help it go faster?




I had two ideas, but both of them appear to be wrong.



My first idea was that Spark is attempting to list all of the files that it will need to do the computation. I tested this by actually creating a list of files offline and feeding them to Spark directly rather than using glob syntax:



val fileList = loadFiles()
sparkSession.read.json(fileList:_*)



This actually caused the "hanging" period to last longer!



My second idea was that Spark is using this time to create a schema for all of the data. But I ruled this out by manually specifying a schema:



val schema = createSchema()
sparksession.read.schema(schema).json(...)



Here the "hanging" period was the same as before, though the computation overall was much quicker.



So I'm not really sure what's going on or how to diagnose it. Anyone else run into this?










share|improve this question















I have a bunch of JSON data in AWS S3 - let's say 100k files, each around 5MB - and I'm using Spark 2.2's DataFrameReader to read and process them via:



sparkSession.read.json(...)



I've found that Spark will just sort of hang for 5 minutes or so before beginning the computation. This can take hours for larger data sets. When I say "hang" I mean that the terminal visualization indicating what stage the cluster is working on and how far along it is doesn't appear - as far as I can tell it is somehow in between stages.




What is Spark doing during this period, and how can I help it go faster?




I had two ideas, but both of them appear to be wrong.



My first idea was that Spark is attempting to list all of the files that it will need to do the computation. I tested this by actually creating a list of files offline and feeding them to Spark directly rather than using glob syntax:



val fileList = loadFiles()
sparkSession.read.json(fileList:_*)



This actually caused the "hanging" period to last longer!



My second idea was that Spark is using this time to create a schema for all of the data. But I ruled this out by manually specifying a schema:



val schema = createSchema()
sparksession.read.schema(schema).json(...)



Here the "hanging" period was the same as before, though the computation overall was much quicker.



So I'm not really sure what's going on or how to diagnose it. Anyone else run into this?







apache-spark






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 20 at 23:07









thebluephantom

2,3552925




2,3552925










asked Nov 20 at 20:10









Paul Siegel

490418




490418








  • 1




    I can't really tell you what spark is doing, but I can say that spark has never been happy with reading many small files. If you have any way of aggregating the files into fewer but much bigger files (of say at least 500MB) beforehand, you should see a vast speed up.
    – Glennie Helles Sindholt
    Nov 21 at 9:27










  • @GliennieHellesSindholt Thank you for that suggestion - now that you mention it, I can think of other large datasets that I have worked with which were partitioned more efficiently, and they didn't have this problem. I remain modestly hopeful for a workaround which doesn't require restructuring the data, but maybe that's the best strategy. In your experience is there an upper limit to the file size for this purpose? Maybe it depends on the cluster?
    – Paul Siegel
    Nov 21 at 14:15












  • Well, I work with S3 which has a 5GB upper limit (or at least they used to - not sure if that has been removed actually), so my files are always smaller than 5GB. However, in all of my write job, I do make sure that I store data in few but large files for this exact reason.
    – Glennie Helles Sindholt
    Nov 22 at 9:17










  • @GlennieHellesSindholt I took your / Steve's advice, and it paid off significantly - using fewer but bigger files (about 2.6gb) the "down time" was eliminated and the job ran faster than before. Thanks!
    – Paul Siegel
    Nov 27 at 18:30














  • 1




    I can't really tell you what spark is doing, but I can say that spark has never been happy with reading many small files. If you have any way of aggregating the files into fewer but much bigger files (of say at least 500MB) beforehand, you should see a vast speed up.
    – Glennie Helles Sindholt
    Nov 21 at 9:27










  • @GliennieHellesSindholt Thank you for that suggestion - now that you mention it, I can think of other large datasets that I have worked with which were partitioned more efficiently, and they didn't have this problem. I remain modestly hopeful for a workaround which doesn't require restructuring the data, but maybe that's the best strategy. In your experience is there an upper limit to the file size for this purpose? Maybe it depends on the cluster?
    – Paul Siegel
    Nov 21 at 14:15












  • Well, I work with S3 which has a 5GB upper limit (or at least they used to - not sure if that has been removed actually), so my files are always smaller than 5GB. However, in all of my write job, I do make sure that I store data in few but large files for this exact reason.
    – Glennie Helles Sindholt
    Nov 22 at 9:17










  • @GlennieHellesSindholt I took your / Steve's advice, and it paid off significantly - using fewer but bigger files (about 2.6gb) the "down time" was eliminated and the job ran faster than before. Thanks!
    – Paul Siegel
    Nov 27 at 18:30








1




1




I can't really tell you what spark is doing, but I can say that spark has never been happy with reading many small files. If you have any way of aggregating the files into fewer but much bigger files (of say at least 500MB) beforehand, you should see a vast speed up.
– Glennie Helles Sindholt
Nov 21 at 9:27




I can't really tell you what spark is doing, but I can say that spark has never been happy with reading many small files. If you have any way of aggregating the files into fewer but much bigger files (of say at least 500MB) beforehand, you should see a vast speed up.
– Glennie Helles Sindholt
Nov 21 at 9:27












@GliennieHellesSindholt Thank you for that suggestion - now that you mention it, I can think of other large datasets that I have worked with which were partitioned more efficiently, and they didn't have this problem. I remain modestly hopeful for a workaround which doesn't require restructuring the data, but maybe that's the best strategy. In your experience is there an upper limit to the file size for this purpose? Maybe it depends on the cluster?
– Paul Siegel
Nov 21 at 14:15






@GliennieHellesSindholt Thank you for that suggestion - now that you mention it, I can think of other large datasets that I have worked with which were partitioned more efficiently, and they didn't have this problem. I remain modestly hopeful for a workaround which doesn't require restructuring the data, but maybe that's the best strategy. In your experience is there an upper limit to the file size for this purpose? Maybe it depends on the cluster?
– Paul Siegel
Nov 21 at 14:15














Well, I work with S3 which has a 5GB upper limit (or at least they used to - not sure if that has been removed actually), so my files are always smaller than 5GB. However, in all of my write job, I do make sure that I store data in few but large files for this exact reason.
– Glennie Helles Sindholt
Nov 22 at 9:17




Well, I work with S3 which has a 5GB upper limit (or at least they used to - not sure if that has been removed actually), so my files are always smaller than 5GB. However, in all of my write job, I do make sure that I store data in few but large files for this exact reason.
– Glennie Helles Sindholt
Nov 22 at 9:17












@GlennieHellesSindholt I took your / Steve's advice, and it paid off significantly - using fewer but bigger files (about 2.6gb) the "down time" was eliminated and the job ran faster than before. Thanks!
– Paul Siegel
Nov 27 at 18:30




@GlennieHellesSindholt I took your / Steve's advice, and it paid off significantly - using fewer but bigger files (about 2.6gb) the "down time" was eliminated and the job ran faster than before. Thanks!
– Paul Siegel
Nov 27 at 18:30












1 Answer
1






active

oldest

votes


















0














The cost of listing directory trees in S3 is very high, "partitioning". This is what you are experiencing.



Fixes




  • fewer, larger files

  • shallower directory tree






share|improve this answer





















  • Thank you - this sounds plausible given the way my data is organized. However, I would have expected that feeding spark a full list of files (no globs) would save it the trouble of parsing the directory tree - in my experiments this actually made things worse. Is there some other way to cache the directory tree or something to speed up future computations? I'm doing many computations on the same data, so maybe the best strategy is to merge files together and simplify the directory structure as you suggest.
    – Paul Siegel
    Nov 21 at 14:07










  • yes, merging is good, not just for listing but for scheduling: each worker will be scheduled with all or part of a single file at a time: the bigger the files, the more work each can do before it needs to talk to the driver to commit that work and request some more
    – Steve Loughran
    Nov 21 at 14:55










  • +spark probably took the file list and rescanned it anyway, as it wouldn't know which were directories and which were files. one HEAD request per file.
    – Steve Loughran
    Nov 21 at 14:56










  • I bit the bullet and restructured the data so that the directory tree is flat and there are 1000 files instead of close to 1 million. It paid off: the "down time" was completely eliminated, and the various computations ran significantly faster. So you were right on the money. (Now to try to convince the systems team to change the way we ingest data...)
    – Paul Siegel
    Nov 27 at 18:28











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53400787%2fwhat-is-happening-during-the-down-time-when-spark-is-reading-in-big-data-sets%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














The cost of listing directory trees in S3 is very high, "partitioning". This is what you are experiencing.



Fixes




  • fewer, larger files

  • shallower directory tree






share|improve this answer





















  • Thank you - this sounds plausible given the way my data is organized. However, I would have expected that feeding spark a full list of files (no globs) would save it the trouble of parsing the directory tree - in my experiments this actually made things worse. Is there some other way to cache the directory tree or something to speed up future computations? I'm doing many computations on the same data, so maybe the best strategy is to merge files together and simplify the directory structure as you suggest.
    – Paul Siegel
    Nov 21 at 14:07










  • yes, merging is good, not just for listing but for scheduling: each worker will be scheduled with all or part of a single file at a time: the bigger the files, the more work each can do before it needs to talk to the driver to commit that work and request some more
    – Steve Loughran
    Nov 21 at 14:55










  • +spark probably took the file list and rescanned it anyway, as it wouldn't know which were directories and which were files. one HEAD request per file.
    – Steve Loughran
    Nov 21 at 14:56










  • I bit the bullet and restructured the data so that the directory tree is flat and there are 1000 files instead of close to 1 million. It paid off: the "down time" was completely eliminated, and the various computations ran significantly faster. So you were right on the money. (Now to try to convince the systems team to change the way we ingest data...)
    – Paul Siegel
    Nov 27 at 18:28
















0














The cost of listing directory trees in S3 is very high, "partitioning". This is what you are experiencing.



Fixes




  • fewer, larger files

  • shallower directory tree






share|improve this answer





















  • Thank you - this sounds plausible given the way my data is organized. However, I would have expected that feeding spark a full list of files (no globs) would save it the trouble of parsing the directory tree - in my experiments this actually made things worse. Is there some other way to cache the directory tree or something to speed up future computations? I'm doing many computations on the same data, so maybe the best strategy is to merge files together and simplify the directory structure as you suggest.
    – Paul Siegel
    Nov 21 at 14:07










  • yes, merging is good, not just for listing but for scheduling: each worker will be scheduled with all or part of a single file at a time: the bigger the files, the more work each can do before it needs to talk to the driver to commit that work and request some more
    – Steve Loughran
    Nov 21 at 14:55










  • +spark probably took the file list and rescanned it anyway, as it wouldn't know which were directories and which were files. one HEAD request per file.
    – Steve Loughran
    Nov 21 at 14:56










  • I bit the bullet and restructured the data so that the directory tree is flat and there are 1000 files instead of close to 1 million. It paid off: the "down time" was completely eliminated, and the various computations ran significantly faster. So you were right on the money. (Now to try to convince the systems team to change the way we ingest data...)
    – Paul Siegel
    Nov 27 at 18:28














0












0








0






The cost of listing directory trees in S3 is very high, "partitioning". This is what you are experiencing.



Fixes




  • fewer, larger files

  • shallower directory tree






share|improve this answer












The cost of listing directory trees in S3 is very high, "partitioning". This is what you are experiencing.



Fixes




  • fewer, larger files

  • shallower directory tree







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 21 at 12:50









Steve Loughran

5,09311417




5,09311417












  • Thank you - this sounds plausible given the way my data is organized. However, I would have expected that feeding spark a full list of files (no globs) would save it the trouble of parsing the directory tree - in my experiments this actually made things worse. Is there some other way to cache the directory tree or something to speed up future computations? I'm doing many computations on the same data, so maybe the best strategy is to merge files together and simplify the directory structure as you suggest.
    – Paul Siegel
    Nov 21 at 14:07










  • yes, merging is good, not just for listing but for scheduling: each worker will be scheduled with all or part of a single file at a time: the bigger the files, the more work each can do before it needs to talk to the driver to commit that work and request some more
    – Steve Loughran
    Nov 21 at 14:55










  • +spark probably took the file list and rescanned it anyway, as it wouldn't know which were directories and which were files. one HEAD request per file.
    – Steve Loughran
    Nov 21 at 14:56










  • I bit the bullet and restructured the data so that the directory tree is flat and there are 1000 files instead of close to 1 million. It paid off: the "down time" was completely eliminated, and the various computations ran significantly faster. So you were right on the money. (Now to try to convince the systems team to change the way we ingest data...)
    – Paul Siegel
    Nov 27 at 18:28


















  • Thank you - this sounds plausible given the way my data is organized. However, I would have expected that feeding spark a full list of files (no globs) would save it the trouble of parsing the directory tree - in my experiments this actually made things worse. Is there some other way to cache the directory tree or something to speed up future computations? I'm doing many computations on the same data, so maybe the best strategy is to merge files together and simplify the directory structure as you suggest.
    – Paul Siegel
    Nov 21 at 14:07










  • yes, merging is good, not just for listing but for scheduling: each worker will be scheduled with all or part of a single file at a time: the bigger the files, the more work each can do before it needs to talk to the driver to commit that work and request some more
    – Steve Loughran
    Nov 21 at 14:55










  • +spark probably took the file list and rescanned it anyway, as it wouldn't know which were directories and which were files. one HEAD request per file.
    – Steve Loughran
    Nov 21 at 14:56










  • I bit the bullet and restructured the data so that the directory tree is flat and there are 1000 files instead of close to 1 million. It paid off: the "down time" was completely eliminated, and the various computations ran significantly faster. So you were right on the money. (Now to try to convince the systems team to change the way we ingest data...)
    – Paul Siegel
    Nov 27 at 18:28
















Thank you - this sounds plausible given the way my data is organized. However, I would have expected that feeding spark a full list of files (no globs) would save it the trouble of parsing the directory tree - in my experiments this actually made things worse. Is there some other way to cache the directory tree or something to speed up future computations? I'm doing many computations on the same data, so maybe the best strategy is to merge files together and simplify the directory structure as you suggest.
– Paul Siegel
Nov 21 at 14:07




Thank you - this sounds plausible given the way my data is organized. However, I would have expected that feeding spark a full list of files (no globs) would save it the trouble of parsing the directory tree - in my experiments this actually made things worse. Is there some other way to cache the directory tree or something to speed up future computations? I'm doing many computations on the same data, so maybe the best strategy is to merge files together and simplify the directory structure as you suggest.
– Paul Siegel
Nov 21 at 14:07












yes, merging is good, not just for listing but for scheduling: each worker will be scheduled with all or part of a single file at a time: the bigger the files, the more work each can do before it needs to talk to the driver to commit that work and request some more
– Steve Loughran
Nov 21 at 14:55




yes, merging is good, not just for listing but for scheduling: each worker will be scheduled with all or part of a single file at a time: the bigger the files, the more work each can do before it needs to talk to the driver to commit that work and request some more
– Steve Loughran
Nov 21 at 14:55












+spark probably took the file list and rescanned it anyway, as it wouldn't know which were directories and which were files. one HEAD request per file.
– Steve Loughran
Nov 21 at 14:56




+spark probably took the file list and rescanned it anyway, as it wouldn't know which were directories and which were files. one HEAD request per file.
– Steve Loughran
Nov 21 at 14:56












I bit the bullet and restructured the data so that the directory tree is flat and there are 1000 files instead of close to 1 million. It paid off: the "down time" was completely eliminated, and the various computations ran significantly faster. So you were right on the money. (Now to try to convince the systems team to change the way we ingest data...)
– Paul Siegel
Nov 27 at 18:28




I bit the bullet and restructured the data so that the directory tree is flat and there are 1000 files instead of close to 1 million. It paid off: the "down time" was completely eliminated, and the various computations ran significantly faster. So you were right on the money. (Now to try to convince the systems team to change the way we ingest data...)
– Paul Siegel
Nov 27 at 18:28


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53400787%2fwhat-is-happening-during-the-down-time-when-spark-is-reading-in-big-data-sets%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

404 Error Contact Form 7 ajax form submitting

How to know if a Active Directory user can login interactively

TypeError: fit_transform() missing 1 required positional argument: 'X'