Hive View Query Performance: Union tables with different schemas
up vote
2
down vote
favorite
I have a scenario where I have two Hive tables, and the second one is essentially an evolved schema of the first (it has 1 more column in this example).
Table_A
{
business_date String
Name String
Age Number
} partitioned by business_date
Table_B {
business_date String
Name String
Age Number
Address String
} partitioned by business_date
In order to obfuscate downstream users from schema changes, I am creating a Hive view with the following syntax:
Create VIEW customer_info AS
select * from Table_B
UNION
select business_date, name, age, null as address from Table_A
I know the above returns all the data, but from a performance standpoint, if a query run against the view with a valid business_date value, does it take the partition key into account? Or do I lose this benefit when working with views?
Edit: I should mention that business_date is actually a unique value across all partitions. This means, that data provided in Table_A, should not be provided in Table_B. Think of Table_A as being an "older version" of data. Given this, is this the best approach of serving the data if the goal is to abstract schema changes away from the end consumers?
Edit#2: Storing this data in one table is not possible due to tons of other problems.
hadoop hive hiveql hive-query
add a comment |
up vote
2
down vote
favorite
I have a scenario where I have two Hive tables, and the second one is essentially an evolved schema of the first (it has 1 more column in this example).
Table_A
{
business_date String
Name String
Age Number
} partitioned by business_date
Table_B {
business_date String
Name String
Age Number
Address String
} partitioned by business_date
In order to obfuscate downstream users from schema changes, I am creating a Hive view with the following syntax:
Create VIEW customer_info AS
select * from Table_B
UNION
select business_date, name, age, null as address from Table_A
I know the above returns all the data, but from a performance standpoint, if a query run against the view with a valid business_date value, does it take the partition key into account? Or do I lose this benefit when working with views?
Edit: I should mention that business_date is actually a unique value across all partitions. This means, that data provided in Table_A, should not be provided in Table_B. Think of Table_A as being an "older version" of data. Given this, is this the best approach of serving the data if the goal is to abstract schema changes away from the end consumers?
Edit#2: Storing this data in one table is not possible due to tons of other problems.
hadoop hive hiveql hive-query
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I have a scenario where I have two Hive tables, and the second one is essentially an evolved schema of the first (it has 1 more column in this example).
Table_A
{
business_date String
Name String
Age Number
} partitioned by business_date
Table_B {
business_date String
Name String
Age Number
Address String
} partitioned by business_date
In order to obfuscate downstream users from schema changes, I am creating a Hive view with the following syntax:
Create VIEW customer_info AS
select * from Table_B
UNION
select business_date, name, age, null as address from Table_A
I know the above returns all the data, but from a performance standpoint, if a query run against the view with a valid business_date value, does it take the partition key into account? Or do I lose this benefit when working with views?
Edit: I should mention that business_date is actually a unique value across all partitions. This means, that data provided in Table_A, should not be provided in Table_B. Think of Table_A as being an "older version" of data. Given this, is this the best approach of serving the data if the goal is to abstract schema changes away from the end consumers?
Edit#2: Storing this data in one table is not possible due to tons of other problems.
hadoop hive hiveql hive-query
I have a scenario where I have two Hive tables, and the second one is essentially an evolved schema of the first (it has 1 more column in this example).
Table_A
{
business_date String
Name String
Age Number
} partitioned by business_date
Table_B {
business_date String
Name String
Age Number
Address String
} partitioned by business_date
In order to obfuscate downstream users from schema changes, I am creating a Hive view with the following syntax:
Create VIEW customer_info AS
select * from Table_B
UNION
select business_date, name, age, null as address from Table_A
I know the above returns all the data, but from a performance standpoint, if a query run against the view with a valid business_date value, does it take the partition key into account? Or do I lose this benefit when working with views?
Edit: I should mention that business_date is actually a unique value across all partitions. This means, that data provided in Table_A, should not be provided in Table_B. Think of Table_A as being an "older version" of data. Given this, is this the best approach of serving the data if the goal is to abstract schema changes away from the end consumers?
Edit#2: Storing this data in one table is not possible due to tons of other problems.
hadoop hive hiveql hive-query
hadoop hive hiveql hive-query
edited Nov 20 at 2:31
asked Nov 20 at 2:22
NicolasCage
113
113
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
You are not using any partition predicates in your query, that is why it will be no partition pruning. Use explain command to check this, it will show partition predicates applied. Partition pruning should work fine with a view.
If business_date is unique value across all partitions then using UNION makes no sense here because all rows are unique. UNION
is the same as UNION ALL
+DISTINCT
.
Use UNION ALL
instead, it will perform much better.
UNION ALL makes so much more sense, completely forgot about that, thanks!.In terms of partition predicates, is there a way to get it applied in this specific scenario considering that both tables are partitioned by business_date and that value is unique across both tables.
– NicolasCage
Nov 21 at 5:10
@NicolasCage If you are not filtering by business_date, partitions will not help in this case. Try to increase parallelism to achieve better performance: stackoverflow.com/a/48487306/2700344
– leftjoin
Nov 21 at 7:20
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
You are not using any partition predicates in your query, that is why it will be no partition pruning. Use explain command to check this, it will show partition predicates applied. Partition pruning should work fine with a view.
If business_date is unique value across all partitions then using UNION makes no sense here because all rows are unique. UNION
is the same as UNION ALL
+DISTINCT
.
Use UNION ALL
instead, it will perform much better.
UNION ALL makes so much more sense, completely forgot about that, thanks!.In terms of partition predicates, is there a way to get it applied in this specific scenario considering that both tables are partitioned by business_date and that value is unique across both tables.
– NicolasCage
Nov 21 at 5:10
@NicolasCage If you are not filtering by business_date, partitions will not help in this case. Try to increase parallelism to achieve better performance: stackoverflow.com/a/48487306/2700344
– leftjoin
Nov 21 at 7:20
add a comment |
up vote
0
down vote
You are not using any partition predicates in your query, that is why it will be no partition pruning. Use explain command to check this, it will show partition predicates applied. Partition pruning should work fine with a view.
If business_date is unique value across all partitions then using UNION makes no sense here because all rows are unique. UNION
is the same as UNION ALL
+DISTINCT
.
Use UNION ALL
instead, it will perform much better.
UNION ALL makes so much more sense, completely forgot about that, thanks!.In terms of partition predicates, is there a way to get it applied in this specific scenario considering that both tables are partitioned by business_date and that value is unique across both tables.
– NicolasCage
Nov 21 at 5:10
@NicolasCage If you are not filtering by business_date, partitions will not help in this case. Try to increase parallelism to achieve better performance: stackoverflow.com/a/48487306/2700344
– leftjoin
Nov 21 at 7:20
add a comment |
up vote
0
down vote
up vote
0
down vote
You are not using any partition predicates in your query, that is why it will be no partition pruning. Use explain command to check this, it will show partition predicates applied. Partition pruning should work fine with a view.
If business_date is unique value across all partitions then using UNION makes no sense here because all rows are unique. UNION
is the same as UNION ALL
+DISTINCT
.
Use UNION ALL
instead, it will perform much better.
You are not using any partition predicates in your query, that is why it will be no partition pruning. Use explain command to check this, it will show partition predicates applied. Partition pruning should work fine with a view.
If business_date is unique value across all partitions then using UNION makes no sense here because all rows are unique. UNION
is the same as UNION ALL
+DISTINCT
.
Use UNION ALL
instead, it will perform much better.
answered Nov 20 at 8:25
leftjoin
7,75421950
7,75421950
UNION ALL makes so much more sense, completely forgot about that, thanks!.In terms of partition predicates, is there a way to get it applied in this specific scenario considering that both tables are partitioned by business_date and that value is unique across both tables.
– NicolasCage
Nov 21 at 5:10
@NicolasCage If you are not filtering by business_date, partitions will not help in this case. Try to increase parallelism to achieve better performance: stackoverflow.com/a/48487306/2700344
– leftjoin
Nov 21 at 7:20
add a comment |
UNION ALL makes so much more sense, completely forgot about that, thanks!.In terms of partition predicates, is there a way to get it applied in this specific scenario considering that both tables are partitioned by business_date and that value is unique across both tables.
– NicolasCage
Nov 21 at 5:10
@NicolasCage If you are not filtering by business_date, partitions will not help in this case. Try to increase parallelism to achieve better performance: stackoverflow.com/a/48487306/2700344
– leftjoin
Nov 21 at 7:20
UNION ALL makes so much more sense, completely forgot about that, thanks!.In terms of partition predicates, is there a way to get it applied in this specific scenario considering that both tables are partitioned by business_date and that value is unique across both tables.
– NicolasCage
Nov 21 at 5:10
UNION ALL makes so much more sense, completely forgot about that, thanks!.In terms of partition predicates, is there a way to get it applied in this specific scenario considering that both tables are partitioned by business_date and that value is unique across both tables.
– NicolasCage
Nov 21 at 5:10
@NicolasCage If you are not filtering by business_date, partitions will not help in this case. Try to increase parallelism to achieve better performance: stackoverflow.com/a/48487306/2700344
– leftjoin
Nov 21 at 7:20
@NicolasCage If you are not filtering by business_date, partitions will not help in this case. Try to increase parallelism to achieve better performance: stackoverflow.com/a/48487306/2700344
– leftjoin
Nov 21 at 7:20
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53385323%2fhive-view-query-performance-union-tables-with-different-schemas%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown