Hive View Query Performance: Union tables with different schemas

up vote
2
down vote

favorite

I have a scenario where I have two Hive tables, and the second one is essentially an evolved schema of the first (it has 1 more column in this example).

Table_A

{

business_date String

Name String

Age Number

} partitioned by business_date



Table_B {

business_date String

Name String

Age Number

Address String

} partitioned by business_date

In order to obfuscate downstream users from schema changes, I am creating a Hive view with the following syntax:

Create VIEW customer_info AS 

select * from Table_B 

UNION 

select business_date, name, age, null as address from Table_A

I know the above returns all the data, but from a performance standpoint, if a query run against the view with a valid business_date value, does it take the partition key into account? Or do I lose this benefit when working with views?

Edit: I should mention that business_date is actually a unique value across all partitions. This means, that data provided in Table_A, should not be provided in Table_B. Think of Table_A as being an "older version" of data. Given this, is this the best approach of serving the data if the goal is to abstract schema changes away from the end consumers?

Edit#2: Storing this data in one table is not possible due to tons of other problems.

edited Nov 20 at 2:31

asked Nov 20 at 2:22

NicolasCage

113

add a comment |

up vote
2
down vote

favorite

I have a scenario where I have two Hive tables, and the second one is essentially an evolved schema of the first (it has 1 more column in this example).

Table_A

{

business_date String

Name String

Age Number

} partitioned by business_date



Table_B {

business_date String

Name String

Age Number

Address String

} partitioned by business_date

In order to obfuscate downstream users from schema changes, I am creating a Hive view with the following syntax:

Create VIEW customer_info AS 

select * from Table_B 

UNION 

select business_date, name, age, null as address from Table_A

Edit#2: Storing this data in one table is not possible due to tons of other problems.

edited Nov 20 at 2:31

asked Nov 20 at 2:22

NicolasCage

113

add a comment |

up vote
2
down vote

favorite

I have a scenario where I have two Hive tables, and the second one is essentially an evolved schema of the first (it has 1 more column in this example).

Table_A

{

business_date String

Name String

Age Number

} partitioned by business_date



Table_B {

business_date String

Name String

Age Number

Address String

} partitioned by business_date

In order to obfuscate downstream users from schema changes, I am creating a Hive view with the following syntax:

Create VIEW customer_info AS 

select * from Table_B 

UNION 

select business_date, name, age, null as address from Table_A

Edit#2: Storing this data in one table is not possible due to tons of other problems.

edited Nov 20 at 2:31

asked Nov 20 at 2:22

NicolasCage

113

I have a scenario where I have two Hive tables, and the second one is essentially an evolved schema of the first (it has 1 more column in this example).

Table_A

{

business_date String

Name String

Age Number

} partitioned by business_date



Table_B {

business_date String

Name String

Age Number

Address String

} partitioned by business_date

In order to obfuscate downstream users from schema changes, I am creating a Hive view with the following syntax:

Create VIEW customer_info AS 

select * from Table_B 

UNION 

select business_date, name, age, null as address from Table_A

Edit#2: Storing this data in one table is not possible due to tons of other problems.

hadoop hive hiveql hive-query

edited Nov 20 at 2:31

asked Nov 20 at 2:22

NicolasCage

113

edited Nov 20 at 2:31

asked Nov 20 at 2:22

NicolasCage

113

edited Nov 20 at 2:31

asked Nov 20 at 2:22

NicolasCage

113

asked Nov 20 at 2:22

NicolasCage

113

asked Nov 20 at 2:22

NicolasCage

113

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

You are not using any partition predicates in your query, that is why it will be no partition pruning. Use explain command to check this, it will show partition predicates applied. Partition pruning should work fine with a view.

If business_date is unique value across all partitions then using UNION makes no sense here because all rows are unique. UNION is the same as UNION ALL+DISTINCT.
Use UNION ALL instead, it will perform much better.

answered Nov 20 at 8:25

leftjoin

7,75421950

UNION ALL makes so much more sense, completely forgot about that, thanks!.In terms of partition predicates, is there a way to get it applied in this specific scenario considering that both tables are partitioned by business_date and that value is unique across both tables.
– NicolasCage
Nov 21 at 5:10

@NicolasCage If you are not filtering by business_date, partitions will not help in this case. Try to increase parallelism to achieve better performance: stackoverflow.com/a/48487306/2700344
– leftjoin
Nov 21 at 7:20

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53385323%2fhive-view-query-performance-union-tables-with-different-schemas%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

answered Nov 20 at 8:25

leftjoin

7,75421950

UNION ALL makes so much more sense, completely forgot about that, thanks!.In terms of partition predicates, is there a way to get it applied in this specific scenario considering that both tables are partitioned by business_date and that value is unique across both tables.
– NicolasCage
Nov 21 at 5:10

@NicolasCage If you are not filtering by business_date, partitions will not help in this case. Try to increase parallelism to achieve better performance: stackoverflow.com/a/48487306/2700344
– leftjoin
Nov 21 at 7:20

add a comment |

up vote
0
down vote

answered Nov 20 at 8:25

leftjoin

7,75421950

UNION ALL makes so much more sense, completely forgot about that, thanks!.In terms of partition predicates, is there a way to get it applied in this specific scenario considering that both tables are partitioned by business_date and that value is unique across both tables.
– NicolasCage
Nov 21 at 5:10

@NicolasCage If you are not filtering by business_date, partitions will not help in this case. Try to increase parallelism to achieve better performance: stackoverflow.com/a/48487306/2700344
– leftjoin
Nov 21 at 7:20

add a comment |

up vote
0
down vote

answered Nov 20 at 8:25

leftjoin

7,75421950

answered Nov 20 at 8:25

leftjoin

7,75421950

answered Nov 20 at 8:25

leftjoin

7,75421950

answered Nov 20 at 8:25

leftjoin

7,75421950

answered Nov 20 at 8:25

leftjoin

7,75421950

UNION ALL makes so much more sense, completely forgot about that, thanks!.In terms of partition predicates, is there a way to get it applied in this specific scenario considering that both tables are partitioned by business_date and that value is unique across both tables.
– NicolasCage
Nov 21 at 5:10

@NicolasCage If you are not filtering by business_date, partitions will not help in this case. Try to increase parallelism to achieve better performance: stackoverflow.com/a/48487306/2700344
– leftjoin
Nov 21 at 7:20

add a comment |

UNION ALL makes so much more sense, completely forgot about that, thanks!.In terms of partition predicates, is there a way to get it applied in this specific scenario considering that both tables are partitioned by business_date and that value is unique across both tables.
– NicolasCage
Nov 21 at 5:10

@NicolasCage If you are not filtering by business_date, partitions will not help in this case. Try to increase parallelism to achieve better performance: stackoverflow.com/a/48487306/2700344
– leftjoin
Nov 21 at 7:20

UNION ALL makes so much more sense, completely forgot about that, thanks!.In terms of partition predicates, is there a way to get it applied in this specific scenario considering that both tables are partitioned by business_date and that value is unique across both tables.
– NicolasCage
Nov 21 at 5:10

@NicolasCage If you are not filtering by business_date, partitions will not help in this case. Try to increase parallelism to achieve better performance: stackoverflow.com/a/48487306/2700344
– leftjoin
Nov 21 at 7:20

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Tukukkk