How to use new Hadoop parquet magic commiter to custom S3 server with Spark

up vote
1
down vote

favorite

I have spark 2.4.0 and Hadoop 3.1.1. According to Hadoop Documentation, to use the new Magic committer that allow write of parquet files to S3 consistently, I've setup those values in conf/spark-default.conf:

spark.sql.sources.commitProtocolClass       com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol

spark.sql.parquet.output.committer.class    org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter

spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a    org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory

spark.hadoop.fs.s3a.committer.name          magic

spark.hadoop.fs.s3a.committer.magic.enabled true

When using this configuration I end up with the exception:

java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol

My question is double, first do I properly understand that Hadoop 3.1.1 allow write of parquet file to S3 consistently ?

Second, if I did understand well, how to use the new committer properly from Spark ?

edited Nov 20 at 9:33

asked Nov 20 at 8:32

Kiwy

2222532

add a comment |

up vote
1
down vote

favorite

spark.sql.sources.commitProtocolClass       com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol

spark.sql.parquet.output.committer.class    org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter

spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a    org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory

spark.hadoop.fs.s3a.committer.name          magic

spark.hadoop.fs.s3a.committer.magic.enabled true

When using this configuration I end up with the exception:

java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol

edited Nov 20 at 9:33

asked Nov 20 at 8:32

Kiwy

2222532

add a comment |

up vote
1
down vote

favorite

spark.sql.sources.commitProtocolClass       com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol

spark.sql.parquet.output.committer.class    org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter

spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a    org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory

spark.hadoop.fs.s3a.committer.name          magic

spark.hadoop.fs.s3a.committer.magic.enabled true

When using this configuration I end up with the exception:

java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol

edited Nov 20 at 9:33

asked Nov 20 at 8:32

Kiwy

2222532

spark.sql.sources.commitProtocolClass       com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol

spark.sql.parquet.output.committer.class    org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter

spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a    org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory

spark.hadoop.fs.s3a.committer.name          magic

spark.hadoop.fs.s3a.committer.magic.enabled true

When using this configuration I end up with the exception:

java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol

apache-spark hadoop amazon-s3

edited Nov 20 at 9:33

asked Nov 20 at 8:32

Kiwy

2222532

edited Nov 20 at 9:33

asked Nov 20 at 8:32

Kiwy

2222532

edited Nov 20 at 9:33

asked Nov 20 at 8:32

Kiwy

2222532

asked Nov 20 at 8:32

Kiwy

2222532

asked Nov 20 at 8:32

Kiwy

2222532

add a comment |

2 Answers
2

active

oldest

votes

up vote
1
down vote

Kiwy: that's my code: I can help you with this. Some of the classes haven't got into the ASF spark releases, but you'll find the in the Hadoop JARs, and I could have a go at building the ASF release with the relevant dependencies in (I could put them in downstream; they used to be there)

You do not need S3Guard turned on to use the "staging committer"; it's only the "magic" variant which needs consistent object store listings during the commit phase.

answered Nov 20 at 15:00

Steve Loughran

5,00511417

I shall try this today.
– Kiwy
Nov 21 at 4:21

I won't be able to use Magic Commiter I'm trying to use ` fs.s3a.committer.name=partitioned` and fs.s3a.committer.staging.conflict-mode=fail So far it's is OK. replace mode would consistently throw a 403 error I suspect my instance of Swift is not to be relied on. I should try with a recent minio server to confirm how consistent the error is. You've been of great help Steve, Thank you a lot.
– Kiwy
Nov 21 at 8:20

Not tested with either. If it doesn't work, file a bug on Apache jira, component "fs/s3", link to S3A features for Hadoop 3.3
– Steve Loughran
Nov 21 at 12:48

So far my attempts from a s3 500GB csv to s3 parquet in 3393 pieces failed with this message: filecsv.write.parquet("s3a://bucket/file.parquet") 2018-11-21 12:47:36 ERROR FileFormatWriter:91 - Aborting job 594be6. org.apache.hadoop.fs.s3a.AWSBadRequestException: delete on s3a://bucket/file.parquet/_temporary: com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema. (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: txa745d7; S3 Extended Request ID: null)
– Kiwy
Nov 21 at 13:02

I confirm you that older swift release might not support the new partitionner. thank you for you amazing work.
– Kiwy
Nov 30 at 8:12

add a comment |

up vote
-1
down vote

accepted

Edit:

OK, I've two intances of server one being a bit old now, I've attempted to use last version of minio with those parameters:

sc.hadoopConfiguration.set("hadoop.fs.s3a.path.style.access","true")

sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload","true")

sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload.buffer","bytebuffer")

sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")

sc.hadoopConfiguration.set("fs.s3a.multipart.size","128M")

sc.hadoopConfiguration.set("fs.s3a.fast.upload.active.blocks","4")

sc.hadoopConfiguration.set("fs.s3a.committer.name","partitioned")

I'm able to write so far without trouble.

However my swift server which is a bit older with this config:

sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")

seems to not support properly the partionner.

Regarding "Hadoop S3guard":

It is not possible currently, Hadoop S3guard that keep metadata of the S3 files must be enable in Hadoop. The S3guard though rely on DynamoDB a proprietary Amazon service.

There's no alternative now like a sqlite file or other DB system to store the metadata.

So if you're using S3 with minio or any other S3 implementation, you're missing DynamoDB.

This article explains nicely how works S3guard

edited Nov 30 at 8:12

answered Nov 20 at 9:36

Kiwy

2222532

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53388976%2fhow-to-use-new-hadoop-parquet-magic-commiter-to-custom-s3-server-with-spark%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
1
down vote

You do not need S3Guard turned on to use the "staging committer"; it's only the "magic" variant which needs consistent object store listings during the commit phase.

answered Nov 20 at 15:00

Steve Loughran

5,00511417

I shall try this today.
– Kiwy
Nov 21 at 4:21

I won't be able to use Magic Commiter I'm trying to use ` fs.s3a.committer.name=partitioned` and fs.s3a.committer.staging.conflict-mode=fail So far it's is OK. replace mode would consistently throw a 403 error I suspect my instance of Swift is not to be relied on. I should try with a recent minio server to confirm how consistent the error is. You've been of great help Steve, Thank you a lot.
– Kiwy
Nov 21 at 8:20

Not tested with either. If it doesn't work, file a bug on Apache jira, component "fs/s3", link to S3A features for Hadoop 3.3
– Steve Loughran
Nov 21 at 12:48

So far my attempts from a s3 500GB csv to s3 parquet in 3393 pieces failed with this message: filecsv.write.parquet("s3a://bucket/file.parquet") 2018-11-21 12:47:36 ERROR FileFormatWriter:91 - Aborting job 594be6. org.apache.hadoop.fs.s3a.AWSBadRequestException: delete on s3a://bucket/file.parquet/_temporary: com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema. (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: txa745d7; S3 Extended Request ID: null)
– Kiwy
Nov 21 at 13:02

I confirm you that older swift release might not support the new partitionner. thank you for you amazing work.
– Kiwy
Nov 30 at 8:12

add a comment |

up vote
1
down vote

You do not need S3Guard turned on to use the "staging committer"; it's only the "magic" variant which needs consistent object store listings during the commit phase.

answered Nov 20 at 15:00

Steve Loughran

5,00511417

I shall try this today.
– Kiwy
Nov 21 at 4:21

I won't be able to use Magic Commiter I'm trying to use ` fs.s3a.committer.name=partitioned` and fs.s3a.committer.staging.conflict-mode=fail So far it's is OK. replace mode would consistently throw a 403 error I suspect my instance of Swift is not to be relied on. I should try with a recent minio server to confirm how consistent the error is. You've been of great help Steve, Thank you a lot.
– Kiwy
Nov 21 at 8:20

Not tested with either. If it doesn't work, file a bug on Apache jira, component "fs/s3", link to S3A features for Hadoop 3.3
– Steve Loughran
Nov 21 at 12:48

So far my attempts from a s3 500GB csv to s3 parquet in 3393 pieces failed with this message: filecsv.write.parquet("s3a://bucket/file.parquet") 2018-11-21 12:47:36 ERROR FileFormatWriter:91 - Aborting job 594be6. org.apache.hadoop.fs.s3a.AWSBadRequestException: delete on s3a://bucket/file.parquet/_temporary: com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema. (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: txa745d7; S3 Extended Request ID: null)
– Kiwy
Nov 21 at 13:02

I confirm you that older swift release might not support the new partitionner. thank you for you amazing work.
– Kiwy
Nov 30 at 8:12

add a comment |

up vote
1
down vote

You do not need S3Guard turned on to use the "staging committer"; it's only the "magic" variant which needs consistent object store listings during the commit phase.

answered Nov 20 at 15:00

Steve Loughran

5,00511417

You do not need S3Guard turned on to use the "staging committer"; it's only the "magic" variant which needs consistent object store listings during the commit phase.

answered Nov 20 at 15:00

Steve Loughran

5,00511417

answered Nov 20 at 15:00

Steve Loughran

5,00511417

answered Nov 20 at 15:00

Steve Loughran

5,00511417

answered Nov 20 at 15:00

Steve Loughran

5,00511417

I shall try this today.
– Kiwy
Nov 21 at 4:21

I won't be able to use Magic Commiter I'm trying to use ` fs.s3a.committer.name=partitioned` and fs.s3a.committer.staging.conflict-mode=fail So far it's is OK. replace mode would consistently throw a 403 error I suspect my instance of Swift is not to be relied on. I should try with a recent minio server to confirm how consistent the error is. You've been of great help Steve, Thank you a lot.
– Kiwy
Nov 21 at 8:20

Not tested with either. If it doesn't work, file a bug on Apache jira, component "fs/s3", link to S3A features for Hadoop 3.3
– Steve Loughran
Nov 21 at 12:48

So far my attempts from a s3 500GB csv to s3 parquet in 3393 pieces failed with this message: filecsv.write.parquet("s3a://bucket/file.parquet") 2018-11-21 12:47:36 ERROR FileFormatWriter:91 - Aborting job 594be6. org.apache.hadoop.fs.s3a.AWSBadRequestException: delete on s3a://bucket/file.parquet/_temporary: com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema. (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: txa745d7; S3 Extended Request ID: null)
– Kiwy
Nov 21 at 13:02

I confirm you that older swift release might not support the new partitionner. thank you for you amazing work.
– Kiwy
Nov 30 at 8:12

add a comment |

I shall try this today.
– Kiwy
Nov 21 at 4:21

I won't be able to use Magic Commiter I'm trying to use ` fs.s3a.committer.name=partitioned` and fs.s3a.committer.staging.conflict-mode=fail So far it's is OK. replace mode would consistently throw a 403 error I suspect my instance of Swift is not to be relied on. I should try with a recent minio server to confirm how consistent the error is. You've been of great help Steve, Thank you a lot.
– Kiwy
Nov 21 at 8:20

Not tested with either. If it doesn't work, file a bug on Apache jira, component "fs/s3", link to S3A features for Hadoop 3.3
– Steve Loughran
Nov 21 at 12:48

So far my attempts from a s3 500GB csv to s3 parquet in 3393 pieces failed with this message: filecsv.write.parquet("s3a://bucket/file.parquet") 2018-11-21 12:47:36 ERROR FileFormatWriter:91 - Aborting job 594be6. org.apache.hadoop.fs.s3a.AWSBadRequestException: delete on s3a://bucket/file.parquet/_temporary: com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema. (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: txa745d7; S3 Extended Request ID: null)
– Kiwy
Nov 21 at 13:02

I confirm you that older swift release might not support the new partitionner. thank you for you amazing work.
– Kiwy
Nov 30 at 8:12

I shall try this today.
– Kiwy
Nov 21 at 4:21

I won't be able to use Magic Commiter I'm trying to use ` fs.s3a.committer.name=partitioned` and fs.s3a.committer.staging.conflict-mode=fail So far it's is OK. replace mode would consistently throw a 403 error I suspect my instance of Swift is not to be relied on. I should try with a recent minio server to confirm how consistent the error is. You've been of great help Steve, Thank you a lot.
– Kiwy
Nov 21 at 8:20

Not tested with either. If it doesn't work, file a bug on Apache jira, component "fs/s3", link to S3A features for Hadoop 3.3
– Steve Loughran
Nov 21 at 12:48

So far my attempts from a s3 500GB csv to s3 parquet in 3393 pieces failed with this message: filecsv.write.parquet("s3a://bucket/file.parquet") 2018-11-21 12:47:36 ERROR FileFormatWriter:91 - Aborting job 594be6. org.apache.hadoop.fs.s3a.AWSBadRequestException: delete on s3a://bucket/file.parquet/_temporary: com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema. (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: txa745d7; S3 Extended Request ID: null)
– Kiwy
Nov 21 at 13:02

I confirm you that older swift release might not support the new partitionner. thank you for you amazing work.
– Kiwy
Nov 30 at 8:12

add a comment |

up vote
-1
down vote

accepted

Edit:

OK, I've two intances of server one being a bit old now, I've attempted to use last version of minio with those parameters:

sc.hadoopConfiguration.set("hadoop.fs.s3a.path.style.access","true")

sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload","true")

sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload.buffer","bytebuffer")

sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")

sc.hadoopConfiguration.set("fs.s3a.multipart.size","128M")

sc.hadoopConfiguration.set("fs.s3a.fast.upload.active.blocks","4")

sc.hadoopConfiguration.set("fs.s3a.committer.name","partitioned")

I'm able to write so far without trouble.

However my swift server which is a bit older with this config:

sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")

seems to not support properly the partionner.

edited Nov 30 at 8:12

answered Nov 20 at 9:36

Kiwy

2222532

add a comment |

up vote
-1
down vote

accepted

Edit:

OK, I've two intances of server one being a bit old now, I've attempted to use last version of minio with those parameters:

sc.hadoopConfiguration.set("hadoop.fs.s3a.path.style.access","true")

sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload","true")

sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload.buffer","bytebuffer")

sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")

sc.hadoopConfiguration.set("fs.s3a.multipart.size","128M")

sc.hadoopConfiguration.set("fs.s3a.fast.upload.active.blocks","4")

sc.hadoopConfiguration.set("fs.s3a.committer.name","partitioned")

I'm able to write so far without trouble.

However my swift server which is a bit older with this config:

sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")

seems to not support properly the partionner.

edited Nov 30 at 8:12

answered Nov 20 at 9:36

Kiwy

2222532

add a comment |

up vote
-1
down vote

accepted

Edit:

OK, I've two intances of server one being a bit old now, I've attempted to use last version of minio with those parameters:

sc.hadoopConfiguration.set("hadoop.fs.s3a.path.style.access","true")

sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload","true")

sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload.buffer","bytebuffer")

sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")

sc.hadoopConfiguration.set("fs.s3a.multipart.size","128M")

sc.hadoopConfiguration.set("fs.s3a.fast.upload.active.blocks","4")

sc.hadoopConfiguration.set("fs.s3a.committer.name","partitioned")

I'm able to write so far without trouble.

However my swift server which is a bit older with this config:

sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")

seems to not support properly the partionner.

edited Nov 30 at 8:12

answered Nov 20 at 9:36

Kiwy

2222532

Edit:

OK, I've two intances of server one being a bit old now, I've attempted to use last version of minio with those parameters:

sc.hadoopConfiguration.set("hadoop.fs.s3a.path.style.access","true")

sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload","true")

sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload.buffer","bytebuffer")

sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")

sc.hadoopConfiguration.set("fs.s3a.multipart.size","128M")

sc.hadoopConfiguration.set("fs.s3a.fast.upload.active.blocks","4")

sc.hadoopConfiguration.set("fs.s3a.committer.name","partitioned")

I'm able to write so far without trouble.

However my swift server which is a bit older with this config:

sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")

seems to not support properly the partionner.

edited Nov 30 at 8:12

answered Nov 20 at 9:36

Kiwy

2222532

edited Nov 30 at 8:12

answered Nov 20 at 9:36

Kiwy

2222532

answered Nov 20 at 9:36

Kiwy

2222532

answered Nov 20 at 9:36

Kiwy

2222532

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Tukukkk