Basic Idea of Dataproc: How does it operate?

I am trying to understand the operational aspects of dataproc.

Let's say, I have a bunch of csv files in the cloud storage bucket, and I have a single Python script which reads through them, does some aggregations, and saves data to bigquery. Thats how it works on a single machine.

If I create a dataproc cluster, and let that script be run simultaneously by the nodes of the cluster, how is this thing going to be parallelized between the cluster nodes? Will each node try to read all the files and do the aggregations OR each one will automagically read their respective subset? I am just trying to grasp how it will operate. Thanks.

asked Nov 24 '18 at 3:07

khan

2,00983052

add a comment |

I am trying to understand the operational aspects of dataproc.

asked Nov 24 '18 at 3:07

khan

2,00983052

add a comment |

I am trying to understand the operational aspects of dataproc.

asked Nov 24 '18 at 3:07

khan

2,00983052

I am trying to understand the operational aspects of dataproc.

bigdata google-cloud-dataproc

asked Nov 24 '18 at 3:07

khan

2,00983052

asked Nov 24 '18 at 3:07

khan

2,00983052

asked Nov 24 '18 at 3:07

khan

2,00983052

asked Nov 24 '18 at 3:07

khan

2,00983052

asked Nov 24 '18 at 3:07

khan

2,00983052

add a comment |

2 Answers
2

active

oldest

votes

Basically, when using Hadoop MapReduce/Spark in Dataproc to process data in GCS, there are 2 layers of abstractions involved.

One is the file system layer, GCS connector implements the Hadoop file system API which allows users to read/write a file from/to GCS, it is similar to HDFS. File system layer allows random read from any offset, but it has no knowledge about the format of the file (e.g., CSV, Parquet, Avro, etc).

There is another layer - InputFormat, which sits on top of the file system layer and knows the file format. A specific InputFormat knows how to break a file into splits (e.g., break a CSV file into multiple splits with different offsets) and turn each split into records (e.g., turn each line of a CSV file into a record).

When you write a MapReduce/Spark job, you know the format of the file, so you choose a specific InputFormat class. The InputFormat implementation can return the splits (metadata) of the file, then MapReduce/Spark can distribute the splits (metadata) to different workers in the cluster to process in parallel.

edited Nov 26 '18 at 23:57

answered Nov 26 '18 at 23:50

Dagang

7,211205780

add a comment |

Let's suppose that your csv files have the same structure, also let's suppose that you wrote your python script using the Spark API with a small mapreduce transformation whose result is going to be written in Bigquery.

Will each node try to read all the files and do the aggregations OR
each one will automatically read their respective subset?

We don't need to take care about this. Your client program will read as fast as possible. You only has to specify the location of your files, for example:

distFile = sc.textFile("file.csv")

distFile = sc.textFile("/hdfs/path/to/folder/fixed_file_name_*.csv")

NOTE: Please note that there are additional implications on the file(s) location, it could be local file (file://) or remote file (hdfs://).

The parallel read process will be managed by YARN depending on the workers availability. Once this first stage is finished, the aggregation (transformation) can be carried out. This second stage is also managed by YARN. Let's suppose that your file only contains a numerical column, then this is the transformation:

val mapFile = distFile.map(x => x * 2)

the mapFile variable will contain a file with the same number of rows than distFile, each new line is the square of the original number. As you can see, you only write the transformation, while YARN will schedule the execution by distributing the workload among the available workers (sub-tasks performing the same operation with different numbers).

After that, you will be able to write the variable mapFile into a Bigquery table using the 'bq load' command from the Dataproc BQ connector.

How is this thing going to be parallelized between the cluster
nodes?

It is not an easy job because there are many factors that should be considered, like space, memory, availability, etc on the workers side, that's why YARN is built for this critical scheduling decisions. In fact, there are different approaches that YARN can use for specific workloads when scheduling jobs; CapacitySceduler or FairScheduler. There are some extra information in the YARN UI when you run a job.

answered Dec 27 '18 at 2:01

rsantiago

3596

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53454834%2fbasic-idea-of-dataproc-how-does-it-operate%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Basically, when using Hadoop MapReduce/Spark in Dataproc to process data in GCS, there are 2 layers of abstractions involved.

edited Nov 26 '18 at 23:57

answered Nov 26 '18 at 23:50

Dagang

7,211205780

add a comment |

Basically, when using Hadoop MapReduce/Spark in Dataproc to process data in GCS, there are 2 layers of abstractions involved.

edited Nov 26 '18 at 23:57

answered Nov 26 '18 at 23:50

Dagang

7,211205780

add a comment |

Basically, when using Hadoop MapReduce/Spark in Dataproc to process data in GCS, there are 2 layers of abstractions involved.

edited Nov 26 '18 at 23:57

answered Nov 26 '18 at 23:50

Dagang

7,211205780

Basically, when using Hadoop MapReduce/Spark in Dataproc to process data in GCS, there are 2 layers of abstractions involved.

edited Nov 26 '18 at 23:57

answered Nov 26 '18 at 23:50

Dagang

7,211205780

edited Nov 26 '18 at 23:57

answered Nov 26 '18 at 23:50

Dagang

7,211205780

answered Nov 26 '18 at 23:50

Dagang

7,211205780

answered Nov 26 '18 at 23:50

Dagang

7,211205780

add a comment |

Will each node try to read all the files and do the aggregations OR
each one will automatically read their respective subset?

We don't need to take care about this. Your client program will read as fast as possible. You only has to specify the location of your files, for example:

distFile = sc.textFile("file.csv")

distFile = sc.textFile("/hdfs/path/to/folder/fixed_file_name_*.csv")

NOTE: Please note that there are additional implications on the file(s) location, it could be local file (file://) or remote file (hdfs://).

val mapFile = distFile.map(x => x * 2)

After that, you will be able to write the variable mapFile into a Bigquery table using the 'bq load' command from the Dataproc BQ connector.

How is this thing going to be parallelized between the cluster
nodes?

answered Dec 27 '18 at 2:01

rsantiago

3596

add a comment |

Will each node try to read all the files and do the aggregations OR
each one will automatically read their respective subset?

We don't need to take care about this. Your client program will read as fast as possible. You only has to specify the location of your files, for example:

distFile = sc.textFile("file.csv")

distFile = sc.textFile("/hdfs/path/to/folder/fixed_file_name_*.csv")

NOTE: Please note that there are additional implications on the file(s) location, it could be local file (file://) or remote file (hdfs://).

val mapFile = distFile.map(x => x * 2)

After that, you will be able to write the variable mapFile into a Bigquery table using the 'bq load' command from the Dataproc BQ connector.

How is this thing going to be parallelized between the cluster
nodes?

answered Dec 27 '18 at 2:01

rsantiago

3596

add a comment |

Will each node try to read all the files and do the aggregations OR
each one will automatically read their respective subset?

We don't need to take care about this. Your client program will read as fast as possible. You only has to specify the location of your files, for example:

distFile = sc.textFile("file.csv")

distFile = sc.textFile("/hdfs/path/to/folder/fixed_file_name_*.csv")

NOTE: Please note that there are additional implications on the file(s) location, it could be local file (file://) or remote file (hdfs://).

val mapFile = distFile.map(x => x * 2)

After that, you will be able to write the variable mapFile into a Bigquery table using the 'bq load' command from the Dataproc BQ connector.

How is this thing going to be parallelized between the cluster
nodes?

answered Dec 27 '18 at 2:01

rsantiago

3596

Will each node try to read all the files and do the aggregations OR
each one will automatically read their respective subset?

We don't need to take care about this. Your client program will read as fast as possible. You only has to specify the location of your files, for example:

distFile = sc.textFile("file.csv")

distFile = sc.textFile("/hdfs/path/to/folder/fixed_file_name_*.csv")

NOTE: Please note that there are additional implications on the file(s) location, it could be local file (file://) or remote file (hdfs://).

val mapFile = distFile.map(x => x * 2)

After that, you will be able to write the variable mapFile into a Bigquery table using the 'bq load' command from the Dataproc BQ connector.

How is this thing going to be parallelized between the cluster
nodes?

answered Dec 27 '18 at 2:01

rsantiago

3596

answered Dec 27 '18 at 2:01

rsantiago

3596

answered Dec 27 '18 at 2:01

rsantiago

3596

answered Dec 27 '18 at 2:01

rsantiago

3596

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Tukukkk