Hive View Query Performance: Union tables with different schemas











up vote
2
down vote

favorite












I have a scenario where I have two Hive tables, and the second one is essentially an evolved schema of the first (it has 1 more column in this example).



Table_A
{
business_date String
Name String
Age Number
} partitioned by business_date

Table_B {
business_date String
Name String
Age Number
Address String
} partitioned by business_date


In order to obfuscate downstream users from schema changes, I am creating a Hive view with the following syntax:



Create VIEW customer_info AS 
select * from Table_B
UNION
select business_date, name, age, null as address from Table_A


I know the above returns all the data, but from a performance standpoint, if a query run against the view with a valid business_date value, does it take the partition key into account? Or do I lose this benefit when working with views?



Edit: I should mention that business_date is actually a unique value across all partitions. This means, that data provided in Table_A, should not be provided in Table_B. Think of Table_A as being an "older version" of data. Given this, is this the best approach of serving the data if the goal is to abstract schema changes away from the end consumers?



Edit#2: Storing this data in one table is not possible due to tons of other problems.










share|improve this question




























    up vote
    2
    down vote

    favorite












    I have a scenario where I have two Hive tables, and the second one is essentially an evolved schema of the first (it has 1 more column in this example).



    Table_A
    {
    business_date String
    Name String
    Age Number
    } partitioned by business_date

    Table_B {
    business_date String
    Name String
    Age Number
    Address String
    } partitioned by business_date


    In order to obfuscate downstream users from schema changes, I am creating a Hive view with the following syntax:



    Create VIEW customer_info AS 
    select * from Table_B
    UNION
    select business_date, name, age, null as address from Table_A


    I know the above returns all the data, but from a performance standpoint, if a query run against the view with a valid business_date value, does it take the partition key into account? Or do I lose this benefit when working with views?



    Edit: I should mention that business_date is actually a unique value across all partitions. This means, that data provided in Table_A, should not be provided in Table_B. Think of Table_A as being an "older version" of data. Given this, is this the best approach of serving the data if the goal is to abstract schema changes away from the end consumers?



    Edit#2: Storing this data in one table is not possible due to tons of other problems.










    share|improve this question


























      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      I have a scenario where I have two Hive tables, and the second one is essentially an evolved schema of the first (it has 1 more column in this example).



      Table_A
      {
      business_date String
      Name String
      Age Number
      } partitioned by business_date

      Table_B {
      business_date String
      Name String
      Age Number
      Address String
      } partitioned by business_date


      In order to obfuscate downstream users from schema changes, I am creating a Hive view with the following syntax:



      Create VIEW customer_info AS 
      select * from Table_B
      UNION
      select business_date, name, age, null as address from Table_A


      I know the above returns all the data, but from a performance standpoint, if a query run against the view with a valid business_date value, does it take the partition key into account? Or do I lose this benefit when working with views?



      Edit: I should mention that business_date is actually a unique value across all partitions. This means, that data provided in Table_A, should not be provided in Table_B. Think of Table_A as being an "older version" of data. Given this, is this the best approach of serving the data if the goal is to abstract schema changes away from the end consumers?



      Edit#2: Storing this data in one table is not possible due to tons of other problems.










      share|improve this question















      I have a scenario where I have two Hive tables, and the second one is essentially an evolved schema of the first (it has 1 more column in this example).



      Table_A
      {
      business_date String
      Name String
      Age Number
      } partitioned by business_date

      Table_B {
      business_date String
      Name String
      Age Number
      Address String
      } partitioned by business_date


      In order to obfuscate downstream users from schema changes, I am creating a Hive view with the following syntax:



      Create VIEW customer_info AS 
      select * from Table_B
      UNION
      select business_date, name, age, null as address from Table_A


      I know the above returns all the data, but from a performance standpoint, if a query run against the view with a valid business_date value, does it take the partition key into account? Or do I lose this benefit when working with views?



      Edit: I should mention that business_date is actually a unique value across all partitions. This means, that data provided in Table_A, should not be provided in Table_B. Think of Table_A as being an "older version" of data. Given this, is this the best approach of serving the data if the goal is to abstract schema changes away from the end consumers?



      Edit#2: Storing this data in one table is not possible due to tons of other problems.







      hadoop hive hiveql hive-query






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 20 at 2:31

























      asked Nov 20 at 2:22









      NicolasCage

      113




      113
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          0
          down vote













          You are not using any partition predicates in your query, that is why it will be no partition pruning. Use explain command to check this, it will show partition predicates applied. Partition pruning should work fine with a view.



          If business_date is unique value across all partitions then using UNION makes no sense here because all rows are unique. UNION is the same as UNION ALL+DISTINCT.
          Use UNION ALL instead, it will perform much better.






          share|improve this answer





















          • UNION ALL makes so much more sense, completely forgot about that, thanks!.In terms of partition predicates, is there a way to get it applied in this specific scenario considering that both tables are partitioned by business_date and that value is unique across both tables.
            – NicolasCage
            Nov 21 at 5:10












          • @NicolasCage If you are not filtering by business_date, partitions will not help in this case. Try to increase parallelism to achieve better performance: stackoverflow.com/a/48487306/2700344
            – leftjoin
            Nov 21 at 7:20











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53385323%2fhive-view-query-performance-union-tables-with-different-schemas%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          0
          down vote













          You are not using any partition predicates in your query, that is why it will be no partition pruning. Use explain command to check this, it will show partition predicates applied. Partition pruning should work fine with a view.



          If business_date is unique value across all partitions then using UNION makes no sense here because all rows are unique. UNION is the same as UNION ALL+DISTINCT.
          Use UNION ALL instead, it will perform much better.






          share|improve this answer





















          • UNION ALL makes so much more sense, completely forgot about that, thanks!.In terms of partition predicates, is there a way to get it applied in this specific scenario considering that both tables are partitioned by business_date and that value is unique across both tables.
            – NicolasCage
            Nov 21 at 5:10












          • @NicolasCage If you are not filtering by business_date, partitions will not help in this case. Try to increase parallelism to achieve better performance: stackoverflow.com/a/48487306/2700344
            – leftjoin
            Nov 21 at 7:20















          up vote
          0
          down vote













          You are not using any partition predicates in your query, that is why it will be no partition pruning. Use explain command to check this, it will show partition predicates applied. Partition pruning should work fine with a view.



          If business_date is unique value across all partitions then using UNION makes no sense here because all rows are unique. UNION is the same as UNION ALL+DISTINCT.
          Use UNION ALL instead, it will perform much better.






          share|improve this answer





















          • UNION ALL makes so much more sense, completely forgot about that, thanks!.In terms of partition predicates, is there a way to get it applied in this specific scenario considering that both tables are partitioned by business_date and that value is unique across both tables.
            – NicolasCage
            Nov 21 at 5:10












          • @NicolasCage If you are not filtering by business_date, partitions will not help in this case. Try to increase parallelism to achieve better performance: stackoverflow.com/a/48487306/2700344
            – leftjoin
            Nov 21 at 7:20













          up vote
          0
          down vote










          up vote
          0
          down vote









          You are not using any partition predicates in your query, that is why it will be no partition pruning. Use explain command to check this, it will show partition predicates applied. Partition pruning should work fine with a view.



          If business_date is unique value across all partitions then using UNION makes no sense here because all rows are unique. UNION is the same as UNION ALL+DISTINCT.
          Use UNION ALL instead, it will perform much better.






          share|improve this answer












          You are not using any partition predicates in your query, that is why it will be no partition pruning. Use explain command to check this, it will show partition predicates applied. Partition pruning should work fine with a view.



          If business_date is unique value across all partitions then using UNION makes no sense here because all rows are unique. UNION is the same as UNION ALL+DISTINCT.
          Use UNION ALL instead, it will perform much better.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 20 at 8:25









          leftjoin

          7,75421950




          7,75421950












          • UNION ALL makes so much more sense, completely forgot about that, thanks!.In terms of partition predicates, is there a way to get it applied in this specific scenario considering that both tables are partitioned by business_date and that value is unique across both tables.
            – NicolasCage
            Nov 21 at 5:10












          • @NicolasCage If you are not filtering by business_date, partitions will not help in this case. Try to increase parallelism to achieve better performance: stackoverflow.com/a/48487306/2700344
            – leftjoin
            Nov 21 at 7:20


















          • UNION ALL makes so much more sense, completely forgot about that, thanks!.In terms of partition predicates, is there a way to get it applied in this specific scenario considering that both tables are partitioned by business_date and that value is unique across both tables.
            – NicolasCage
            Nov 21 at 5:10












          • @NicolasCage If you are not filtering by business_date, partitions will not help in this case. Try to increase parallelism to achieve better performance: stackoverflow.com/a/48487306/2700344
            – leftjoin
            Nov 21 at 7:20
















          UNION ALL makes so much more sense, completely forgot about that, thanks!.In terms of partition predicates, is there a way to get it applied in this specific scenario considering that both tables are partitioned by business_date and that value is unique across both tables.
          – NicolasCage
          Nov 21 at 5:10






          UNION ALL makes so much more sense, completely forgot about that, thanks!.In terms of partition predicates, is there a way to get it applied in this specific scenario considering that both tables are partitioned by business_date and that value is unique across both tables.
          – NicolasCage
          Nov 21 at 5:10














          @NicolasCage If you are not filtering by business_date, partitions will not help in this case. Try to increase parallelism to achieve better performance: stackoverflow.com/a/48487306/2700344
          – leftjoin
          Nov 21 at 7:20




          @NicolasCage If you are not filtering by business_date, partitions will not help in this case. Try to increase parallelism to achieve better performance: stackoverflow.com/a/48487306/2700344
          – leftjoin
          Nov 21 at 7:20


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53385323%2fhive-view-query-performance-union-tables-with-different-schemas%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          404 Error Contact Form 7 ajax form submitting

          How to know if a Active Directory user can login interactively

          TypeError: fit_transform() missing 1 required positional argument: 'X'