Basic Idea of Dataproc: How does it operate?












1















I am trying to understand the operational aspects of dataproc.



Let's say, I have a bunch of csv files in the cloud storage bucket, and I have a single Python script which reads through them, does some aggregations, and saves data to bigquery. Thats how it works on a single machine.



If I create a dataproc cluster, and let that script be run simultaneously by the nodes of the cluster, how is this thing going to be parallelized between the cluster nodes? Will each node try to read all the files and do the aggregations OR each one will automagically read their respective subset? I am just trying to grasp how it will operate. Thanks.










share|improve this question



























    1















    I am trying to understand the operational aspects of dataproc.



    Let's say, I have a bunch of csv files in the cloud storage bucket, and I have a single Python script which reads through them, does some aggregations, and saves data to bigquery. Thats how it works on a single machine.



    If I create a dataproc cluster, and let that script be run simultaneously by the nodes of the cluster, how is this thing going to be parallelized between the cluster nodes? Will each node try to read all the files and do the aggregations OR each one will automagically read their respective subset? I am just trying to grasp how it will operate. Thanks.










    share|improve this question

























      1












      1








      1


      1






      I am trying to understand the operational aspects of dataproc.



      Let's say, I have a bunch of csv files in the cloud storage bucket, and I have a single Python script which reads through them, does some aggregations, and saves data to bigquery. Thats how it works on a single machine.



      If I create a dataproc cluster, and let that script be run simultaneously by the nodes of the cluster, how is this thing going to be parallelized between the cluster nodes? Will each node try to read all the files and do the aggregations OR each one will automagically read their respective subset? I am just trying to grasp how it will operate. Thanks.










      share|improve this question














      I am trying to understand the operational aspects of dataproc.



      Let's say, I have a bunch of csv files in the cloud storage bucket, and I have a single Python script which reads through them, does some aggregations, and saves data to bigquery. Thats how it works on a single machine.



      If I create a dataproc cluster, and let that script be run simultaneously by the nodes of the cluster, how is this thing going to be parallelized between the cluster nodes? Will each node try to read all the files and do the aggregations OR each one will automagically read their respective subset? I am just trying to grasp how it will operate. Thanks.







      bigdata google-cloud-dataproc






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 24 '18 at 3:07









      khankhan

      2,00983052




      2,00983052
























          2 Answers
          2






          active

          oldest

          votes


















          1














          Basically, when using Hadoop MapReduce/Spark in Dataproc to process data in GCS, there are 2 layers of abstractions involved.



          One is the file system layer, GCS connector implements the Hadoop file system API which allows users to read/write a file from/to GCS, it is similar to HDFS. File system layer allows random read from any offset, but it has no knowledge about the format of the file (e.g., CSV, Parquet, Avro, etc).



          There is another layer - InputFormat, which sits on top of the file system layer and knows the file format. A specific InputFormat knows how to break a file into splits (e.g., break a CSV file into multiple splits with different offsets) and turn each split into records (e.g., turn each line of a CSV file into a record).



          When you write a MapReduce/Spark job, you know the format of the file, so you choose a specific InputFormat class. The InputFormat implementation can return the splits (metadata) of the file, then MapReduce/Spark can distribute the splits (metadata) to different workers in the cluster to process in parallel.






          share|improve this answer

































            0














            Let's suppose that your csv files have the same structure, also let's suppose that you wrote your python script using the Spark API with a small mapreduce transformation whose result is going to be written in Bigquery.




            1. Will each node try to read all the files and do the aggregations OR
              each one will automatically read their respective subset?


            We don't need to take care about this. Your client program will read as fast as possible. You only has to specify the location of your files, for example:



            distFile = sc.textFile("file.csv")
            distFile = sc.textFile("/hdfs/path/to/folder/fixed_file_name_*.csv")


            NOTE: Please note that there are additional implications on the file(s) location, it could be local file (file://) or remote file (hdfs://).



            The parallel read process will be managed by YARN depending on the workers availability. Once this first stage is finished, the aggregation (transformation) can be carried out. This second stage is also managed by YARN. Let's suppose that your file only contains a numerical column, then this is the transformation:



            val mapFile = distFile.map(x => x * 2)


            the mapFile variable will contain a file with the same number of rows than distFile, each new line is the square of the original number. As you can see, you only write the transformation, while YARN will schedule the execution by distributing the workload among the available workers (sub-tasks performing the same operation with different numbers).



            After that, you will be able to write the variable mapFile into a Bigquery table using the 'bq load' command from the Dataproc BQ connector.




            1. How is this thing going to be parallelized between the cluster
              nodes?


            It is not an easy job because there are many factors that should be considered, like space, memory, availability, etc on the workers side, that's why YARN is built for this critical scheduling decisions. In fact, there are different approaches that YARN can use for specific workloads when scheduling jobs; CapacitySceduler or FairScheduler. There are some extra information in the YARN UI when you run a job.






            share|improve this answer























              Your Answer






              StackExchange.ifUsing("editor", function () {
              StackExchange.using("externalEditor", function () {
              StackExchange.using("snippets", function () {
              StackExchange.snippets.init();
              });
              });
              }, "code-snippets");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "1"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53454834%2fbasic-idea-of-dataproc-how-does-it-operate%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              1














              Basically, when using Hadoop MapReduce/Spark in Dataproc to process data in GCS, there are 2 layers of abstractions involved.



              One is the file system layer, GCS connector implements the Hadoop file system API which allows users to read/write a file from/to GCS, it is similar to HDFS. File system layer allows random read from any offset, but it has no knowledge about the format of the file (e.g., CSV, Parquet, Avro, etc).



              There is another layer - InputFormat, which sits on top of the file system layer and knows the file format. A specific InputFormat knows how to break a file into splits (e.g., break a CSV file into multiple splits with different offsets) and turn each split into records (e.g., turn each line of a CSV file into a record).



              When you write a MapReduce/Spark job, you know the format of the file, so you choose a specific InputFormat class. The InputFormat implementation can return the splits (metadata) of the file, then MapReduce/Spark can distribute the splits (metadata) to different workers in the cluster to process in parallel.






              share|improve this answer






























                1














                Basically, when using Hadoop MapReduce/Spark in Dataproc to process data in GCS, there are 2 layers of abstractions involved.



                One is the file system layer, GCS connector implements the Hadoop file system API which allows users to read/write a file from/to GCS, it is similar to HDFS. File system layer allows random read from any offset, but it has no knowledge about the format of the file (e.g., CSV, Parquet, Avro, etc).



                There is another layer - InputFormat, which sits on top of the file system layer and knows the file format. A specific InputFormat knows how to break a file into splits (e.g., break a CSV file into multiple splits with different offsets) and turn each split into records (e.g., turn each line of a CSV file into a record).



                When you write a MapReduce/Spark job, you know the format of the file, so you choose a specific InputFormat class. The InputFormat implementation can return the splits (metadata) of the file, then MapReduce/Spark can distribute the splits (metadata) to different workers in the cluster to process in parallel.






                share|improve this answer




























                  1












                  1








                  1







                  Basically, when using Hadoop MapReduce/Spark in Dataproc to process data in GCS, there are 2 layers of abstractions involved.



                  One is the file system layer, GCS connector implements the Hadoop file system API which allows users to read/write a file from/to GCS, it is similar to HDFS. File system layer allows random read from any offset, but it has no knowledge about the format of the file (e.g., CSV, Parquet, Avro, etc).



                  There is another layer - InputFormat, which sits on top of the file system layer and knows the file format. A specific InputFormat knows how to break a file into splits (e.g., break a CSV file into multiple splits with different offsets) and turn each split into records (e.g., turn each line of a CSV file into a record).



                  When you write a MapReduce/Spark job, you know the format of the file, so you choose a specific InputFormat class. The InputFormat implementation can return the splits (metadata) of the file, then MapReduce/Spark can distribute the splits (metadata) to different workers in the cluster to process in parallel.






                  share|improve this answer















                  Basically, when using Hadoop MapReduce/Spark in Dataproc to process data in GCS, there are 2 layers of abstractions involved.



                  One is the file system layer, GCS connector implements the Hadoop file system API which allows users to read/write a file from/to GCS, it is similar to HDFS. File system layer allows random read from any offset, but it has no knowledge about the format of the file (e.g., CSV, Parquet, Avro, etc).



                  There is another layer - InputFormat, which sits on top of the file system layer and knows the file format. A specific InputFormat knows how to break a file into splits (e.g., break a CSV file into multiple splits with different offsets) and turn each split into records (e.g., turn each line of a CSV file into a record).



                  When you write a MapReduce/Spark job, you know the format of the file, so you choose a specific InputFormat class. The InputFormat implementation can return the splits (metadata) of the file, then MapReduce/Spark can distribute the splits (metadata) to different workers in the cluster to process in parallel.







                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Nov 26 '18 at 23:57

























                  answered Nov 26 '18 at 23:50









                  DagangDagang

                  7,211205780




                  7,211205780

























                      0














                      Let's suppose that your csv files have the same structure, also let's suppose that you wrote your python script using the Spark API with a small mapreduce transformation whose result is going to be written in Bigquery.




                      1. Will each node try to read all the files and do the aggregations OR
                        each one will automatically read their respective subset?


                      We don't need to take care about this. Your client program will read as fast as possible. You only has to specify the location of your files, for example:



                      distFile = sc.textFile("file.csv")
                      distFile = sc.textFile("/hdfs/path/to/folder/fixed_file_name_*.csv")


                      NOTE: Please note that there are additional implications on the file(s) location, it could be local file (file://) or remote file (hdfs://).



                      The parallel read process will be managed by YARN depending on the workers availability. Once this first stage is finished, the aggregation (transformation) can be carried out. This second stage is also managed by YARN. Let's suppose that your file only contains a numerical column, then this is the transformation:



                      val mapFile = distFile.map(x => x * 2)


                      the mapFile variable will contain a file with the same number of rows than distFile, each new line is the square of the original number. As you can see, you only write the transformation, while YARN will schedule the execution by distributing the workload among the available workers (sub-tasks performing the same operation with different numbers).



                      After that, you will be able to write the variable mapFile into a Bigquery table using the 'bq load' command from the Dataproc BQ connector.




                      1. How is this thing going to be parallelized between the cluster
                        nodes?


                      It is not an easy job because there are many factors that should be considered, like space, memory, availability, etc on the workers side, that's why YARN is built for this critical scheduling decisions. In fact, there are different approaches that YARN can use for specific workloads when scheduling jobs; CapacitySceduler or FairScheduler. There are some extra information in the YARN UI when you run a job.






                      share|improve this answer




























                        0














                        Let's suppose that your csv files have the same structure, also let's suppose that you wrote your python script using the Spark API with a small mapreduce transformation whose result is going to be written in Bigquery.




                        1. Will each node try to read all the files and do the aggregations OR
                          each one will automatically read their respective subset?


                        We don't need to take care about this. Your client program will read as fast as possible. You only has to specify the location of your files, for example:



                        distFile = sc.textFile("file.csv")
                        distFile = sc.textFile("/hdfs/path/to/folder/fixed_file_name_*.csv")


                        NOTE: Please note that there are additional implications on the file(s) location, it could be local file (file://) or remote file (hdfs://).



                        The parallel read process will be managed by YARN depending on the workers availability. Once this first stage is finished, the aggregation (transformation) can be carried out. This second stage is also managed by YARN. Let's suppose that your file only contains a numerical column, then this is the transformation:



                        val mapFile = distFile.map(x => x * 2)


                        the mapFile variable will contain a file with the same number of rows than distFile, each new line is the square of the original number. As you can see, you only write the transformation, while YARN will schedule the execution by distributing the workload among the available workers (sub-tasks performing the same operation with different numbers).



                        After that, you will be able to write the variable mapFile into a Bigquery table using the 'bq load' command from the Dataproc BQ connector.




                        1. How is this thing going to be parallelized between the cluster
                          nodes?


                        It is not an easy job because there are many factors that should be considered, like space, memory, availability, etc on the workers side, that's why YARN is built for this critical scheduling decisions. In fact, there are different approaches that YARN can use for specific workloads when scheduling jobs; CapacitySceduler or FairScheduler. There are some extra information in the YARN UI when you run a job.






                        share|improve this answer


























                          0












                          0








                          0







                          Let's suppose that your csv files have the same structure, also let's suppose that you wrote your python script using the Spark API with a small mapreduce transformation whose result is going to be written in Bigquery.




                          1. Will each node try to read all the files and do the aggregations OR
                            each one will automatically read their respective subset?


                          We don't need to take care about this. Your client program will read as fast as possible. You only has to specify the location of your files, for example:



                          distFile = sc.textFile("file.csv")
                          distFile = sc.textFile("/hdfs/path/to/folder/fixed_file_name_*.csv")


                          NOTE: Please note that there are additional implications on the file(s) location, it could be local file (file://) or remote file (hdfs://).



                          The parallel read process will be managed by YARN depending on the workers availability. Once this first stage is finished, the aggregation (transformation) can be carried out. This second stage is also managed by YARN. Let's suppose that your file only contains a numerical column, then this is the transformation:



                          val mapFile = distFile.map(x => x * 2)


                          the mapFile variable will contain a file with the same number of rows than distFile, each new line is the square of the original number. As you can see, you only write the transformation, while YARN will schedule the execution by distributing the workload among the available workers (sub-tasks performing the same operation with different numbers).



                          After that, you will be able to write the variable mapFile into a Bigquery table using the 'bq load' command from the Dataproc BQ connector.




                          1. How is this thing going to be parallelized between the cluster
                            nodes?


                          It is not an easy job because there are many factors that should be considered, like space, memory, availability, etc on the workers side, that's why YARN is built for this critical scheduling decisions. In fact, there are different approaches that YARN can use for specific workloads when scheduling jobs; CapacitySceduler or FairScheduler. There are some extra information in the YARN UI when you run a job.






                          share|improve this answer













                          Let's suppose that your csv files have the same structure, also let's suppose that you wrote your python script using the Spark API with a small mapreduce transformation whose result is going to be written in Bigquery.




                          1. Will each node try to read all the files and do the aggregations OR
                            each one will automatically read their respective subset?


                          We don't need to take care about this. Your client program will read as fast as possible. You only has to specify the location of your files, for example:



                          distFile = sc.textFile("file.csv")
                          distFile = sc.textFile("/hdfs/path/to/folder/fixed_file_name_*.csv")


                          NOTE: Please note that there are additional implications on the file(s) location, it could be local file (file://) or remote file (hdfs://).



                          The parallel read process will be managed by YARN depending on the workers availability. Once this first stage is finished, the aggregation (transformation) can be carried out. This second stage is also managed by YARN. Let's suppose that your file only contains a numerical column, then this is the transformation:



                          val mapFile = distFile.map(x => x * 2)


                          the mapFile variable will contain a file with the same number of rows than distFile, each new line is the square of the original number. As you can see, you only write the transformation, while YARN will schedule the execution by distributing the workload among the available workers (sub-tasks performing the same operation with different numbers).



                          After that, you will be able to write the variable mapFile into a Bigquery table using the 'bq load' command from the Dataproc BQ connector.




                          1. How is this thing going to be parallelized between the cluster
                            nodes?


                          It is not an easy job because there are many factors that should be considered, like space, memory, availability, etc on the workers side, that's why YARN is built for this critical scheduling decisions. In fact, there are different approaches that YARN can use for specific workloads when scheduling jobs; CapacitySceduler or FairScheduler. There are some extra information in the YARN UI when you run a job.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Dec 27 '18 at 2:01









                          rsantiagorsantiago

                          3596




                          3596






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53454834%2fbasic-idea-of-dataproc-how-does-it-operate%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              404 Error Contact Form 7 ajax form submitting

                              How to know if a Active Directory user can login interactively

                              Refactoring coordinates for Minecraft Pi buildings written in Python