How to use new Hadoop parquet magic commiter to custom S3 server with Spark











up vote
1
down vote

favorite












I have spark 2.4.0 and Hadoop 3.1.1. According to Hadoop Documentation, to use the new Magic committer that allow write of parquet files to S3 consistently, I've setup those values in conf/spark-default.conf:



spark.sql.sources.commitProtocolClass       com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
spark.sql.parquet.output.committer.class org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.hadoop.fs.s3a.committer.name magic
spark.hadoop.fs.s3a.committer.magic.enabled true


When using this configuration I end up with the exception:



java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol


My question is double, first do I properly understand that Hadoop 3.1.1 allow write of parquet file to S3 consistently ?

Second, if I did understand well, how to use the new committer properly from Spark ?










share|improve this question




























    up vote
    1
    down vote

    favorite












    I have spark 2.4.0 and Hadoop 3.1.1. According to Hadoop Documentation, to use the new Magic committer that allow write of parquet files to S3 consistently, I've setup those values in conf/spark-default.conf:



    spark.sql.sources.commitProtocolClass       com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
    spark.sql.parquet.output.committer.class org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter
    spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
    spark.hadoop.fs.s3a.committer.name magic
    spark.hadoop.fs.s3a.committer.magic.enabled true


    When using this configuration I end up with the exception:



    java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol


    My question is double, first do I properly understand that Hadoop 3.1.1 allow write of parquet file to S3 consistently ?

    Second, if I did understand well, how to use the new committer properly from Spark ?










    share|improve this question


























      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      I have spark 2.4.0 and Hadoop 3.1.1. According to Hadoop Documentation, to use the new Magic committer that allow write of parquet files to S3 consistently, I've setup those values in conf/spark-default.conf:



      spark.sql.sources.commitProtocolClass       com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
      spark.sql.parquet.output.committer.class org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter
      spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
      spark.hadoop.fs.s3a.committer.name magic
      spark.hadoop.fs.s3a.committer.magic.enabled true


      When using this configuration I end up with the exception:



      java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol


      My question is double, first do I properly understand that Hadoop 3.1.1 allow write of parquet file to S3 consistently ?

      Second, if I did understand well, how to use the new committer properly from Spark ?










      share|improve this question















      I have spark 2.4.0 and Hadoop 3.1.1. According to Hadoop Documentation, to use the new Magic committer that allow write of parquet files to S3 consistently, I've setup those values in conf/spark-default.conf:



      spark.sql.sources.commitProtocolClass       com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
      spark.sql.parquet.output.committer.class org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter
      spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
      spark.hadoop.fs.s3a.committer.name magic
      spark.hadoop.fs.s3a.committer.magic.enabled true


      When using this configuration I end up with the exception:



      java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol


      My question is double, first do I properly understand that Hadoop 3.1.1 allow write of parquet file to S3 consistently ?

      Second, if I did understand well, how to use the new committer properly from Spark ?







      apache-spark hadoop amazon-s3






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 20 at 9:33

























      asked Nov 20 at 8:32









      Kiwy

      2222532




      2222532
























          2 Answers
          2






          active

          oldest

          votes

















          up vote
          1
          down vote













          Kiwy: that's my code: I can help you with this. Some of the classes haven't got into the ASF spark releases, but you'll find the in the Hadoop JARs, and I could have a go at building the ASF release with the relevant dependencies in (I could put them in downstream; they used to be there)



          You do not need S3Guard turned on to use the "staging committer"; it's only the "magic" variant which needs consistent object store listings during the commit phase.






          share|improve this answer





















          • I shall try this today.
            – Kiwy
            Nov 21 at 4:21










          • I won't be able to use Magic Commiter I'm trying to use ` fs.s3a.committer.name=partitioned` and fs.s3a.committer.staging.conflict-mode=fail So far it's is OK. replace mode would consistently throw a 403 error I suspect my instance of Swift is not to be relied on. I should try with a recent minio server to confirm how consistent the error is. You've been of great help Steve, Thank you a lot.
            – Kiwy
            Nov 21 at 8:20










          • Not tested with either. If it doesn't work, file a bug on Apache jira, component "fs/s3", link to S3A features for Hadoop 3.3
            – Steve Loughran
            Nov 21 at 12:48










          • So far my attempts from a s3 500GB csv to s3 parquet in 3393 pieces failed with this message: filecsv.write.parquet("s3a://bucket/file.parquet") 2018-11-21 12:47:36 ERROR FileFormatWriter:91 - Aborting job 594be6. org.apache.hadoop.fs.s3a.AWSBadRequestException: delete on s3a://bucket/file.parquet/_temporary: com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema. (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: txa745d7; S3 Extended Request ID: null)
            – Kiwy
            Nov 21 at 13:02










          • I confirm you that older swift release might not support the new partitionner. thank you for you amazing work.
            – Kiwy
            Nov 30 at 8:12


















          up vote
          -1
          down vote



          accepted










          Edit:

          OK, I've two intances of server one being a bit old now, I've attempted to use last version of minio with those parameters:



          sc.hadoopConfiguration.set("hadoop.fs.s3a.path.style.access","true")
          sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload","true")
          sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload.buffer","bytebuffer")
          sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
          sc.hadoopConfiguration.set("fs.s3a.multipart.size","128M")
          sc.hadoopConfiguration.set("fs.s3a.fast.upload.active.blocks","4")
          sc.hadoopConfiguration.set("fs.s3a.committer.name","partitioned")


          I'm able to write so far without trouble.

          However my swift server which is a bit older with this config:



          sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")


          seems to not support properly the partionner.



          Regarding "Hadoop S3guard":

          It is not possible currently, Hadoop S3guard that keep metadata of the S3 files must be enable in Hadoop. The S3guard though rely on DynamoDB a proprietary Amazon service.

          There's no alternative now like a sqlite file or other DB system to store the metadata.

          So if you're using S3 with minio or any other S3 implementation, you're missing DynamoDB.

          This article explains nicely how works S3guard






          share|improve this answer























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53388976%2fhow-to-use-new-hadoop-parquet-magic-commiter-to-custom-s3-server-with-spark%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            1
            down vote













            Kiwy: that's my code: I can help you with this. Some of the classes haven't got into the ASF spark releases, but you'll find the in the Hadoop JARs, and I could have a go at building the ASF release with the relevant dependencies in (I could put them in downstream; they used to be there)



            You do not need S3Guard turned on to use the "staging committer"; it's only the "magic" variant which needs consistent object store listings during the commit phase.






            share|improve this answer





















            • I shall try this today.
              – Kiwy
              Nov 21 at 4:21










            • I won't be able to use Magic Commiter I'm trying to use ` fs.s3a.committer.name=partitioned` and fs.s3a.committer.staging.conflict-mode=fail So far it's is OK. replace mode would consistently throw a 403 error I suspect my instance of Swift is not to be relied on. I should try with a recent minio server to confirm how consistent the error is. You've been of great help Steve, Thank you a lot.
              – Kiwy
              Nov 21 at 8:20










            • Not tested with either. If it doesn't work, file a bug on Apache jira, component "fs/s3", link to S3A features for Hadoop 3.3
              – Steve Loughran
              Nov 21 at 12:48










            • So far my attempts from a s3 500GB csv to s3 parquet in 3393 pieces failed with this message: filecsv.write.parquet("s3a://bucket/file.parquet") 2018-11-21 12:47:36 ERROR FileFormatWriter:91 - Aborting job 594be6. org.apache.hadoop.fs.s3a.AWSBadRequestException: delete on s3a://bucket/file.parquet/_temporary: com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema. (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: txa745d7; S3 Extended Request ID: null)
              – Kiwy
              Nov 21 at 13:02










            • I confirm you that older swift release might not support the new partitionner. thank you for you amazing work.
              – Kiwy
              Nov 30 at 8:12















            up vote
            1
            down vote













            Kiwy: that's my code: I can help you with this. Some of the classes haven't got into the ASF spark releases, but you'll find the in the Hadoop JARs, and I could have a go at building the ASF release with the relevant dependencies in (I could put them in downstream; they used to be there)



            You do not need S3Guard turned on to use the "staging committer"; it's only the "magic" variant which needs consistent object store listings during the commit phase.






            share|improve this answer





















            • I shall try this today.
              – Kiwy
              Nov 21 at 4:21










            • I won't be able to use Magic Commiter I'm trying to use ` fs.s3a.committer.name=partitioned` and fs.s3a.committer.staging.conflict-mode=fail So far it's is OK. replace mode would consistently throw a 403 error I suspect my instance of Swift is not to be relied on. I should try with a recent minio server to confirm how consistent the error is. You've been of great help Steve, Thank you a lot.
              – Kiwy
              Nov 21 at 8:20










            • Not tested with either. If it doesn't work, file a bug on Apache jira, component "fs/s3", link to S3A features for Hadoop 3.3
              – Steve Loughran
              Nov 21 at 12:48










            • So far my attempts from a s3 500GB csv to s3 parquet in 3393 pieces failed with this message: filecsv.write.parquet("s3a://bucket/file.parquet") 2018-11-21 12:47:36 ERROR FileFormatWriter:91 - Aborting job 594be6. org.apache.hadoop.fs.s3a.AWSBadRequestException: delete on s3a://bucket/file.parquet/_temporary: com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema. (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: txa745d7; S3 Extended Request ID: null)
              – Kiwy
              Nov 21 at 13:02










            • I confirm you that older swift release might not support the new partitionner. thank you for you amazing work.
              – Kiwy
              Nov 30 at 8:12













            up vote
            1
            down vote










            up vote
            1
            down vote









            Kiwy: that's my code: I can help you with this. Some of the classes haven't got into the ASF spark releases, but you'll find the in the Hadoop JARs, and I could have a go at building the ASF release with the relevant dependencies in (I could put them in downstream; they used to be there)



            You do not need S3Guard turned on to use the "staging committer"; it's only the "magic" variant which needs consistent object store listings during the commit phase.






            share|improve this answer












            Kiwy: that's my code: I can help you with this. Some of the classes haven't got into the ASF spark releases, but you'll find the in the Hadoop JARs, and I could have a go at building the ASF release with the relevant dependencies in (I could put them in downstream; they used to be there)



            You do not need S3Guard turned on to use the "staging committer"; it's only the "magic" variant which needs consistent object store listings during the commit phase.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 20 at 15:00









            Steve Loughran

            5,00511417




            5,00511417












            • I shall try this today.
              – Kiwy
              Nov 21 at 4:21










            • I won't be able to use Magic Commiter I'm trying to use ` fs.s3a.committer.name=partitioned` and fs.s3a.committer.staging.conflict-mode=fail So far it's is OK. replace mode would consistently throw a 403 error I suspect my instance of Swift is not to be relied on. I should try with a recent minio server to confirm how consistent the error is. You've been of great help Steve, Thank you a lot.
              – Kiwy
              Nov 21 at 8:20










            • Not tested with either. If it doesn't work, file a bug on Apache jira, component "fs/s3", link to S3A features for Hadoop 3.3
              – Steve Loughran
              Nov 21 at 12:48










            • So far my attempts from a s3 500GB csv to s3 parquet in 3393 pieces failed with this message: filecsv.write.parquet("s3a://bucket/file.parquet") 2018-11-21 12:47:36 ERROR FileFormatWriter:91 - Aborting job 594be6. org.apache.hadoop.fs.s3a.AWSBadRequestException: delete on s3a://bucket/file.parquet/_temporary: com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema. (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: txa745d7; S3 Extended Request ID: null)
              – Kiwy
              Nov 21 at 13:02










            • I confirm you that older swift release might not support the new partitionner. thank you for you amazing work.
              – Kiwy
              Nov 30 at 8:12


















            • I shall try this today.
              – Kiwy
              Nov 21 at 4:21










            • I won't be able to use Magic Commiter I'm trying to use ` fs.s3a.committer.name=partitioned` and fs.s3a.committer.staging.conflict-mode=fail So far it's is OK. replace mode would consistently throw a 403 error I suspect my instance of Swift is not to be relied on. I should try with a recent minio server to confirm how consistent the error is. You've been of great help Steve, Thank you a lot.
              – Kiwy
              Nov 21 at 8:20










            • Not tested with either. If it doesn't work, file a bug on Apache jira, component "fs/s3", link to S3A features for Hadoop 3.3
              – Steve Loughran
              Nov 21 at 12:48










            • So far my attempts from a s3 500GB csv to s3 parquet in 3393 pieces failed with this message: filecsv.write.parquet("s3a://bucket/file.parquet") 2018-11-21 12:47:36 ERROR FileFormatWriter:91 - Aborting job 594be6. org.apache.hadoop.fs.s3a.AWSBadRequestException: delete on s3a://bucket/file.parquet/_temporary: com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema. (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: txa745d7; S3 Extended Request ID: null)
              – Kiwy
              Nov 21 at 13:02










            • I confirm you that older swift release might not support the new partitionner. thank you for you amazing work.
              – Kiwy
              Nov 30 at 8:12
















            I shall try this today.
            – Kiwy
            Nov 21 at 4:21




            I shall try this today.
            – Kiwy
            Nov 21 at 4:21












            I won't be able to use Magic Commiter I'm trying to use ` fs.s3a.committer.name=partitioned` and fs.s3a.committer.staging.conflict-mode=fail So far it's is OK. replace mode would consistently throw a 403 error I suspect my instance of Swift is not to be relied on. I should try with a recent minio server to confirm how consistent the error is. You've been of great help Steve, Thank you a lot.
            – Kiwy
            Nov 21 at 8:20




            I won't be able to use Magic Commiter I'm trying to use ` fs.s3a.committer.name=partitioned` and fs.s3a.committer.staging.conflict-mode=fail So far it's is OK. replace mode would consistently throw a 403 error I suspect my instance of Swift is not to be relied on. I should try with a recent minio server to confirm how consistent the error is. You've been of great help Steve, Thank you a lot.
            – Kiwy
            Nov 21 at 8:20












            Not tested with either. If it doesn't work, file a bug on Apache jira, component "fs/s3", link to S3A features for Hadoop 3.3
            – Steve Loughran
            Nov 21 at 12:48




            Not tested with either. If it doesn't work, file a bug on Apache jira, component "fs/s3", link to S3A features for Hadoop 3.3
            – Steve Loughran
            Nov 21 at 12:48












            So far my attempts from a s3 500GB csv to s3 parquet in 3393 pieces failed with this message: filecsv.write.parquet("s3a://bucket/file.parquet") 2018-11-21 12:47:36 ERROR FileFormatWriter:91 - Aborting job 594be6. org.apache.hadoop.fs.s3a.AWSBadRequestException: delete on s3a://bucket/file.parquet/_temporary: com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema. (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: txa745d7; S3 Extended Request ID: null)
            – Kiwy
            Nov 21 at 13:02




            So far my attempts from a s3 500GB csv to s3 parquet in 3393 pieces failed with this message: filecsv.write.parquet("s3a://bucket/file.parquet") 2018-11-21 12:47:36 ERROR FileFormatWriter:91 - Aborting job 594be6. org.apache.hadoop.fs.s3a.AWSBadRequestException: delete on s3a://bucket/file.parquet/_temporary: com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema. (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: txa745d7; S3 Extended Request ID: null)
            – Kiwy
            Nov 21 at 13:02












            I confirm you that older swift release might not support the new partitionner. thank you for you amazing work.
            – Kiwy
            Nov 30 at 8:12




            I confirm you that older swift release might not support the new partitionner. thank you for you amazing work.
            – Kiwy
            Nov 30 at 8:12












            up vote
            -1
            down vote



            accepted










            Edit:

            OK, I've two intances of server one being a bit old now, I've attempted to use last version of minio with those parameters:



            sc.hadoopConfiguration.set("hadoop.fs.s3a.path.style.access","true")
            sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload","true")
            sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload.buffer","bytebuffer")
            sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
            sc.hadoopConfiguration.set("fs.s3a.multipart.size","128M")
            sc.hadoopConfiguration.set("fs.s3a.fast.upload.active.blocks","4")
            sc.hadoopConfiguration.set("fs.s3a.committer.name","partitioned")


            I'm able to write so far without trouble.

            However my swift server which is a bit older with this config:



            sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")


            seems to not support properly the partionner.



            Regarding "Hadoop S3guard":

            It is not possible currently, Hadoop S3guard that keep metadata of the S3 files must be enable in Hadoop. The S3guard though rely on DynamoDB a proprietary Amazon service.

            There's no alternative now like a sqlite file or other DB system to store the metadata.

            So if you're using S3 with minio or any other S3 implementation, you're missing DynamoDB.

            This article explains nicely how works S3guard






            share|improve this answer



























              up vote
              -1
              down vote



              accepted










              Edit:

              OK, I've two intances of server one being a bit old now, I've attempted to use last version of minio with those parameters:



              sc.hadoopConfiguration.set("hadoop.fs.s3a.path.style.access","true")
              sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload","true")
              sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload.buffer","bytebuffer")
              sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
              sc.hadoopConfiguration.set("fs.s3a.multipart.size","128M")
              sc.hadoopConfiguration.set("fs.s3a.fast.upload.active.blocks","4")
              sc.hadoopConfiguration.set("fs.s3a.committer.name","partitioned")


              I'm able to write so far without trouble.

              However my swift server which is a bit older with this config:



              sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")


              seems to not support properly the partionner.



              Regarding "Hadoop S3guard":

              It is not possible currently, Hadoop S3guard that keep metadata of the S3 files must be enable in Hadoop. The S3guard though rely on DynamoDB a proprietary Amazon service.

              There's no alternative now like a sqlite file or other DB system to store the metadata.

              So if you're using S3 with minio or any other S3 implementation, you're missing DynamoDB.

              This article explains nicely how works S3guard






              share|improve this answer

























                up vote
                -1
                down vote



                accepted







                up vote
                -1
                down vote



                accepted






                Edit:

                OK, I've two intances of server one being a bit old now, I've attempted to use last version of minio with those parameters:



                sc.hadoopConfiguration.set("hadoop.fs.s3a.path.style.access","true")
                sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload","true")
                sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload.buffer","bytebuffer")
                sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
                sc.hadoopConfiguration.set("fs.s3a.multipart.size","128M")
                sc.hadoopConfiguration.set("fs.s3a.fast.upload.active.blocks","4")
                sc.hadoopConfiguration.set("fs.s3a.committer.name","partitioned")


                I'm able to write so far without trouble.

                However my swift server which is a bit older with this config:



                sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")


                seems to not support properly the partionner.



                Regarding "Hadoop S3guard":

                It is not possible currently, Hadoop S3guard that keep metadata of the S3 files must be enable in Hadoop. The S3guard though rely on DynamoDB a proprietary Amazon service.

                There's no alternative now like a sqlite file or other DB system to store the metadata.

                So if you're using S3 with minio or any other S3 implementation, you're missing DynamoDB.

                This article explains nicely how works S3guard






                share|improve this answer














                Edit:

                OK, I've two intances of server one being a bit old now, I've attempted to use last version of minio with those parameters:



                sc.hadoopConfiguration.set("hadoop.fs.s3a.path.style.access","true")
                sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload","true")
                sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload.buffer","bytebuffer")
                sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
                sc.hadoopConfiguration.set("fs.s3a.multipart.size","128M")
                sc.hadoopConfiguration.set("fs.s3a.fast.upload.active.blocks","4")
                sc.hadoopConfiguration.set("fs.s3a.committer.name","partitioned")


                I'm able to write so far without trouble.

                However my swift server which is a bit older with this config:



                sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")


                seems to not support properly the partionner.



                Regarding "Hadoop S3guard":

                It is not possible currently, Hadoop S3guard that keep metadata of the S3 files must be enable in Hadoop. The S3guard though rely on DynamoDB a proprietary Amazon service.

                There's no alternative now like a sqlite file or other DB system to store the metadata.

                So if you're using S3 with minio or any other S3 implementation, you're missing DynamoDB.

                This article explains nicely how works S3guard







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Nov 30 at 8:12

























                answered Nov 20 at 9:36









                Kiwy

                2222532




                2222532






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53388976%2fhow-to-use-new-hadoop-parquet-magic-commiter-to-custom-s3-server-with-spark%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    404 Error Contact Form 7 ajax form submitting

                    How to know if a Active Directory user can login interactively

                    TypeError: fit_transform() missing 1 required positional argument: 'X'