Basic Idea of Dataproc: How does it operate?
I am trying to understand the operational aspects of dataproc.
Let's say, I have a bunch of csv files in the cloud storage bucket, and I have a single Python script which reads through them, does some aggregations, and saves data to bigquery. Thats how it works on a single machine.
If I create a dataproc cluster, and let that script be run simultaneously by the nodes of the cluster, how is this thing going to be parallelized between the cluster nodes? Will each node try to read all the files and do the aggregations OR each one will automagically read their respective subset? I am just trying to grasp how it will operate. Thanks.
bigdata google-cloud-dataproc
add a comment |
I am trying to understand the operational aspects of dataproc.
Let's say, I have a bunch of csv files in the cloud storage bucket, and I have a single Python script which reads through them, does some aggregations, and saves data to bigquery. Thats how it works on a single machine.
If I create a dataproc cluster, and let that script be run simultaneously by the nodes of the cluster, how is this thing going to be parallelized between the cluster nodes? Will each node try to read all the files and do the aggregations OR each one will automagically read their respective subset? I am just trying to grasp how it will operate. Thanks.
bigdata google-cloud-dataproc
add a comment |
I am trying to understand the operational aspects of dataproc.
Let's say, I have a bunch of csv files in the cloud storage bucket, and I have a single Python script which reads through them, does some aggregations, and saves data to bigquery. Thats how it works on a single machine.
If I create a dataproc cluster, and let that script be run simultaneously by the nodes of the cluster, how is this thing going to be parallelized between the cluster nodes? Will each node try to read all the files and do the aggregations OR each one will automagically read their respective subset? I am just trying to grasp how it will operate. Thanks.
bigdata google-cloud-dataproc
I am trying to understand the operational aspects of dataproc.
Let's say, I have a bunch of csv files in the cloud storage bucket, and I have a single Python script which reads through them, does some aggregations, and saves data to bigquery. Thats how it works on a single machine.
If I create a dataproc cluster, and let that script be run simultaneously by the nodes of the cluster, how is this thing going to be parallelized between the cluster nodes? Will each node try to read all the files and do the aggregations OR each one will automagically read their respective subset? I am just trying to grasp how it will operate. Thanks.
bigdata google-cloud-dataproc
bigdata google-cloud-dataproc
asked Nov 24 '18 at 3:07
khankhan
2,00983052
2,00983052
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
Basically, when using Hadoop MapReduce/Spark in Dataproc to process data in GCS, there are 2 layers of abstractions involved.
One is the file system layer, GCS connector implements the Hadoop file system API which allows users to read/write a file from/to GCS, it is similar to HDFS. File system layer allows random read from any offset, but it has no knowledge about the format of the file (e.g., CSV, Parquet, Avro, etc).
There is another layer - InputFormat, which sits on top of the file system layer and knows the file format. A specific InputFormat knows how to break a file into splits (e.g., break a CSV file into multiple splits with different offsets) and turn each split into records (e.g., turn each line of a CSV file into a record).
When you write a MapReduce/Spark job, you know the format of the file, so you choose a specific InputFormat class. The InputFormat implementation can return the splits (metadata) of the file, then MapReduce/Spark can distribute the splits (metadata) to different workers in the cluster to process in parallel.
add a comment |
Let's suppose that your csv files have the same structure, also let's suppose that you wrote your python script using the Spark API with a small mapreduce transformation whose result is going to be written in Bigquery.
- Will each node try to read all the files and do the aggregations OR
each one will automatically read their respective subset?
We don't need to take care about this. Your client program will read as fast as possible. You only has to specify the location of your files, for example:
distFile = sc.textFile("file.csv")
distFile = sc.textFile("/hdfs/path/to/folder/fixed_file_name_*.csv")
NOTE: Please note that there are additional implications on the file(s) location, it could be local file (file://) or remote file (hdfs://).
The parallel read process will be managed by YARN depending on the workers availability. Once this first stage is finished, the aggregation (transformation) can be carried out. This second stage is also managed by YARN. Let's suppose that your file only contains a numerical column, then this is the transformation:
val mapFile = distFile.map(x => x * 2)
the mapFile variable will contain a file with the same number of rows than distFile, each new line is the square of the original number. As you can see, you only write the transformation, while YARN will schedule the execution by distributing the workload among the available workers (sub-tasks performing the same operation with different numbers).
After that, you will be able to write the variable mapFile into a Bigquery table using the 'bq load' command from the Dataproc BQ connector.
- How is this thing going to be parallelized between the cluster
nodes?
It is not an easy job because there are many factors that should be considered, like space, memory, availability, etc on the workers side, that's why YARN is built for this critical scheduling decisions. In fact, there are different approaches that YARN can use for specific workloads when scheduling jobs; CapacitySceduler or FairScheduler. There are some extra information in the YARN UI when you run a job.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53454834%2fbasic-idea-of-dataproc-how-does-it-operate%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Basically, when using Hadoop MapReduce/Spark in Dataproc to process data in GCS, there are 2 layers of abstractions involved.
One is the file system layer, GCS connector implements the Hadoop file system API which allows users to read/write a file from/to GCS, it is similar to HDFS. File system layer allows random read from any offset, but it has no knowledge about the format of the file (e.g., CSV, Parquet, Avro, etc).
There is another layer - InputFormat, which sits on top of the file system layer and knows the file format. A specific InputFormat knows how to break a file into splits (e.g., break a CSV file into multiple splits with different offsets) and turn each split into records (e.g., turn each line of a CSV file into a record).
When you write a MapReduce/Spark job, you know the format of the file, so you choose a specific InputFormat class. The InputFormat implementation can return the splits (metadata) of the file, then MapReduce/Spark can distribute the splits (metadata) to different workers in the cluster to process in parallel.
add a comment |
Basically, when using Hadoop MapReduce/Spark in Dataproc to process data in GCS, there are 2 layers of abstractions involved.
One is the file system layer, GCS connector implements the Hadoop file system API which allows users to read/write a file from/to GCS, it is similar to HDFS. File system layer allows random read from any offset, but it has no knowledge about the format of the file (e.g., CSV, Parquet, Avro, etc).
There is another layer - InputFormat, which sits on top of the file system layer and knows the file format. A specific InputFormat knows how to break a file into splits (e.g., break a CSV file into multiple splits with different offsets) and turn each split into records (e.g., turn each line of a CSV file into a record).
When you write a MapReduce/Spark job, you know the format of the file, so you choose a specific InputFormat class. The InputFormat implementation can return the splits (metadata) of the file, then MapReduce/Spark can distribute the splits (metadata) to different workers in the cluster to process in parallel.
add a comment |
Basically, when using Hadoop MapReduce/Spark in Dataproc to process data in GCS, there are 2 layers of abstractions involved.
One is the file system layer, GCS connector implements the Hadoop file system API which allows users to read/write a file from/to GCS, it is similar to HDFS. File system layer allows random read from any offset, but it has no knowledge about the format of the file (e.g., CSV, Parquet, Avro, etc).
There is another layer - InputFormat, which sits on top of the file system layer and knows the file format. A specific InputFormat knows how to break a file into splits (e.g., break a CSV file into multiple splits with different offsets) and turn each split into records (e.g., turn each line of a CSV file into a record).
When you write a MapReduce/Spark job, you know the format of the file, so you choose a specific InputFormat class. The InputFormat implementation can return the splits (metadata) of the file, then MapReduce/Spark can distribute the splits (metadata) to different workers in the cluster to process in parallel.
Basically, when using Hadoop MapReduce/Spark in Dataproc to process data in GCS, there are 2 layers of abstractions involved.
One is the file system layer, GCS connector implements the Hadoop file system API which allows users to read/write a file from/to GCS, it is similar to HDFS. File system layer allows random read from any offset, but it has no knowledge about the format of the file (e.g., CSV, Parquet, Avro, etc).
There is another layer - InputFormat, which sits on top of the file system layer and knows the file format. A specific InputFormat knows how to break a file into splits (e.g., break a CSV file into multiple splits with different offsets) and turn each split into records (e.g., turn each line of a CSV file into a record).
When you write a MapReduce/Spark job, you know the format of the file, so you choose a specific InputFormat class. The InputFormat implementation can return the splits (metadata) of the file, then MapReduce/Spark can distribute the splits (metadata) to different workers in the cluster to process in parallel.
edited Nov 26 '18 at 23:57
answered Nov 26 '18 at 23:50
DagangDagang
7,211205780
7,211205780
add a comment |
add a comment |
Let's suppose that your csv files have the same structure, also let's suppose that you wrote your python script using the Spark API with a small mapreduce transformation whose result is going to be written in Bigquery.
- Will each node try to read all the files and do the aggregations OR
each one will automatically read their respective subset?
We don't need to take care about this. Your client program will read as fast as possible. You only has to specify the location of your files, for example:
distFile = sc.textFile("file.csv")
distFile = sc.textFile("/hdfs/path/to/folder/fixed_file_name_*.csv")
NOTE: Please note that there are additional implications on the file(s) location, it could be local file (file://) or remote file (hdfs://).
The parallel read process will be managed by YARN depending on the workers availability. Once this first stage is finished, the aggregation (transformation) can be carried out. This second stage is also managed by YARN. Let's suppose that your file only contains a numerical column, then this is the transformation:
val mapFile = distFile.map(x => x * 2)
the mapFile variable will contain a file with the same number of rows than distFile, each new line is the square of the original number. As you can see, you only write the transformation, while YARN will schedule the execution by distributing the workload among the available workers (sub-tasks performing the same operation with different numbers).
After that, you will be able to write the variable mapFile into a Bigquery table using the 'bq load' command from the Dataproc BQ connector.
- How is this thing going to be parallelized between the cluster
nodes?
It is not an easy job because there are many factors that should be considered, like space, memory, availability, etc on the workers side, that's why YARN is built for this critical scheduling decisions. In fact, there are different approaches that YARN can use for specific workloads when scheduling jobs; CapacitySceduler or FairScheduler. There are some extra information in the YARN UI when you run a job.
add a comment |
Let's suppose that your csv files have the same structure, also let's suppose that you wrote your python script using the Spark API with a small mapreduce transformation whose result is going to be written in Bigquery.
- Will each node try to read all the files and do the aggregations OR
each one will automatically read their respective subset?
We don't need to take care about this. Your client program will read as fast as possible. You only has to specify the location of your files, for example:
distFile = sc.textFile("file.csv")
distFile = sc.textFile("/hdfs/path/to/folder/fixed_file_name_*.csv")
NOTE: Please note that there are additional implications on the file(s) location, it could be local file (file://) or remote file (hdfs://).
The parallel read process will be managed by YARN depending on the workers availability. Once this first stage is finished, the aggregation (transformation) can be carried out. This second stage is also managed by YARN. Let's suppose that your file only contains a numerical column, then this is the transformation:
val mapFile = distFile.map(x => x * 2)
the mapFile variable will contain a file with the same number of rows than distFile, each new line is the square of the original number. As you can see, you only write the transformation, while YARN will schedule the execution by distributing the workload among the available workers (sub-tasks performing the same operation with different numbers).
After that, you will be able to write the variable mapFile into a Bigquery table using the 'bq load' command from the Dataproc BQ connector.
- How is this thing going to be parallelized between the cluster
nodes?
It is not an easy job because there are many factors that should be considered, like space, memory, availability, etc on the workers side, that's why YARN is built for this critical scheduling decisions. In fact, there are different approaches that YARN can use for specific workloads when scheduling jobs; CapacitySceduler or FairScheduler. There are some extra information in the YARN UI when you run a job.
add a comment |
Let's suppose that your csv files have the same structure, also let's suppose that you wrote your python script using the Spark API with a small mapreduce transformation whose result is going to be written in Bigquery.
- Will each node try to read all the files and do the aggregations OR
each one will automatically read their respective subset?
We don't need to take care about this. Your client program will read as fast as possible. You only has to specify the location of your files, for example:
distFile = sc.textFile("file.csv")
distFile = sc.textFile("/hdfs/path/to/folder/fixed_file_name_*.csv")
NOTE: Please note that there are additional implications on the file(s) location, it could be local file (file://) or remote file (hdfs://).
The parallel read process will be managed by YARN depending on the workers availability. Once this first stage is finished, the aggregation (transformation) can be carried out. This second stage is also managed by YARN. Let's suppose that your file only contains a numerical column, then this is the transformation:
val mapFile = distFile.map(x => x * 2)
the mapFile variable will contain a file with the same number of rows than distFile, each new line is the square of the original number. As you can see, you only write the transformation, while YARN will schedule the execution by distributing the workload among the available workers (sub-tasks performing the same operation with different numbers).
After that, you will be able to write the variable mapFile into a Bigquery table using the 'bq load' command from the Dataproc BQ connector.
- How is this thing going to be parallelized between the cluster
nodes?
It is not an easy job because there are many factors that should be considered, like space, memory, availability, etc on the workers side, that's why YARN is built for this critical scheduling decisions. In fact, there are different approaches that YARN can use for specific workloads when scheduling jobs; CapacitySceduler or FairScheduler. There are some extra information in the YARN UI when you run a job.
Let's suppose that your csv files have the same structure, also let's suppose that you wrote your python script using the Spark API with a small mapreduce transformation whose result is going to be written in Bigquery.
- Will each node try to read all the files and do the aggregations OR
each one will automatically read their respective subset?
We don't need to take care about this. Your client program will read as fast as possible. You only has to specify the location of your files, for example:
distFile = sc.textFile("file.csv")
distFile = sc.textFile("/hdfs/path/to/folder/fixed_file_name_*.csv")
NOTE: Please note that there are additional implications on the file(s) location, it could be local file (file://) or remote file (hdfs://).
The parallel read process will be managed by YARN depending on the workers availability. Once this first stage is finished, the aggregation (transformation) can be carried out. This second stage is also managed by YARN. Let's suppose that your file only contains a numerical column, then this is the transformation:
val mapFile = distFile.map(x => x * 2)
the mapFile variable will contain a file with the same number of rows than distFile, each new line is the square of the original number. As you can see, you only write the transformation, while YARN will schedule the execution by distributing the workload among the available workers (sub-tasks performing the same operation with different numbers).
After that, you will be able to write the variable mapFile into a Bigquery table using the 'bq load' command from the Dataproc BQ connector.
- How is this thing going to be parallelized between the cluster
nodes?
It is not an easy job because there are many factors that should be considered, like space, memory, availability, etc on the workers side, that's why YARN is built for this critical scheduling decisions. In fact, there are different approaches that YARN can use for specific workloads when scheduling jobs; CapacitySceduler or FairScheduler. There are some extra information in the YARN UI when you run a job.
answered Dec 27 '18 at 2:01
rsantiagorsantiago
3596
3596
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53454834%2fbasic-idea-of-dataproc-how-does-it-operate%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown