AWS Glue JSON limit
up vote
0
down vote
favorite
Trying to use AWS Glue to automatically crawl and catalogue JSON files in an S3 bucket as described here:
https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html
Files smaller than 1mb are successfully catalogued however files greater than 1mb fail to be catalogued and are classified as Unknown
.
Have tried approach listed here:
AWS Glue Crawler Classifies json file as UNKNOWN
However makes no difference.
Would love to know if anyone's had similar issues?
amazon-web-services aws-glue
add a comment |
up vote
0
down vote
favorite
Trying to use AWS Glue to automatically crawl and catalogue JSON files in an S3 bucket as described here:
https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html
Files smaller than 1mb are successfully catalogued however files greater than 1mb fail to be catalogued and are classified as Unknown
.
Have tried approach listed here:
AWS Glue Crawler Classifies json file as UNKNOWN
However makes no difference.
Would love to know if anyone's had similar issues?
amazon-web-services aws-glue
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
Trying to use AWS Glue to automatically crawl and catalogue JSON files in an S3 bucket as described here:
https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html
Files smaller than 1mb are successfully catalogued however files greater than 1mb fail to be catalogued and are classified as Unknown
.
Have tried approach listed here:
AWS Glue Crawler Classifies json file as UNKNOWN
However makes no difference.
Would love to know if anyone's had similar issues?
amazon-web-services aws-glue
Trying to use AWS Glue to automatically crawl and catalogue JSON files in an S3 bucket as described here:
https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html
Files smaller than 1mb are successfully catalogued however files greater than 1mb fail to be catalogued and are classified as Unknown
.
Have tried approach listed here:
AWS Glue Crawler Classifies json file as UNKNOWN
However makes no difference.
Would love to know if anyone's had similar issues?
amazon-web-services aws-glue
amazon-web-services aws-glue
asked Nov 20 at 12:10
timothyclifford
4,22033861
4,22033861
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
I have the same problem. Have you tried flattening the data into ORC or similar? There seems to be a limitation on nested JSON of a certain size, even with custom classifiers. Or you can change your JSON from
[
{ .... },
{ .....},
]
into just
{ ... }
{ ... }
Which should work in Glue.
This is the Python script I ran to get that transformation (worked with a 200 mb JSON):
import json
with open('./Data/data.json') as f:
data = json.load(f)
with open('./Data/data_flat.json', 'w') as file:
for entry in data['locations']:
file.write(json.dumps(entry)+'n')
Now glue correctly Classifies it!
Thanks I will take a look and see if this helps!
– timothyclifford
Nov 27 at 18:15
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53392723%2faws-glue-json-limit%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
I have the same problem. Have you tried flattening the data into ORC or similar? There seems to be a limitation on nested JSON of a certain size, even with custom classifiers. Or you can change your JSON from
[
{ .... },
{ .....},
]
into just
{ ... }
{ ... }
Which should work in Glue.
This is the Python script I ran to get that transformation (worked with a 200 mb JSON):
import json
with open('./Data/data.json') as f:
data = json.load(f)
with open('./Data/data_flat.json', 'w') as file:
for entry in data['locations']:
file.write(json.dumps(entry)+'n')
Now glue correctly Classifies it!
Thanks I will take a look and see if this helps!
– timothyclifford
Nov 27 at 18:15
add a comment |
up vote
0
down vote
I have the same problem. Have you tried flattening the data into ORC or similar? There seems to be a limitation on nested JSON of a certain size, even with custom classifiers. Or you can change your JSON from
[
{ .... },
{ .....},
]
into just
{ ... }
{ ... }
Which should work in Glue.
This is the Python script I ran to get that transformation (worked with a 200 mb JSON):
import json
with open('./Data/data.json') as f:
data = json.load(f)
with open('./Data/data_flat.json', 'w') as file:
for entry in data['locations']:
file.write(json.dumps(entry)+'n')
Now glue correctly Classifies it!
Thanks I will take a look and see if this helps!
– timothyclifford
Nov 27 at 18:15
add a comment |
up vote
0
down vote
up vote
0
down vote
I have the same problem. Have you tried flattening the data into ORC or similar? There seems to be a limitation on nested JSON of a certain size, even with custom classifiers. Or you can change your JSON from
[
{ .... },
{ .....},
]
into just
{ ... }
{ ... }
Which should work in Glue.
This is the Python script I ran to get that transformation (worked with a 200 mb JSON):
import json
with open('./Data/data.json') as f:
data = json.load(f)
with open('./Data/data_flat.json', 'w') as file:
for entry in data['locations']:
file.write(json.dumps(entry)+'n')
Now glue correctly Classifies it!
I have the same problem. Have you tried flattening the data into ORC or similar? There seems to be a limitation on nested JSON of a certain size, even with custom classifiers. Or you can change your JSON from
[
{ .... },
{ .....},
]
into just
{ ... }
{ ... }
Which should work in Glue.
This is the Python script I ran to get that transformation (worked with a 200 mb JSON):
import json
with open('./Data/data.json') as f:
data = json.load(f)
with open('./Data/data_flat.json', 'w') as file:
for entry in data['locations']:
file.write(json.dumps(entry)+'n')
Now glue correctly Classifies it!
edited Nov 26 at 11:36
answered Nov 26 at 9:36
Finn Ickler
11
11
Thanks I will take a look and see if this helps!
– timothyclifford
Nov 27 at 18:15
add a comment |
Thanks I will take a look and see if this helps!
– timothyclifford
Nov 27 at 18:15
Thanks I will take a look and see if this helps!
– timothyclifford
Nov 27 at 18:15
Thanks I will take a look and see if this helps!
– timothyclifford
Nov 27 at 18:15
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53392723%2faws-glue-json-limit%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown