Best Handle Intraday GA Data in BigQuery












0














I have a configured google analytics raw data export to big query.



Could anyone from the community suggest efficient ways to query the intraday data as we noticed the problem for Intraday Sync (e.g. 15 minutes delay), the streaming data is growing exponentially across the sync frequency.



For example:




  • Everyday (T-1) batch data (ga_sessions_yyymmdd) syncs with 15-20GB with 3.5M-5M records.

  • On the other side, the intraday data streams (with 15 min delay) more than ~150GB per day with ~30M records.


enter image description here
https://issuetracker.google.com/issues/117064598



It's not cost-effective for persisting & querying the data.



And, is this a product bug or expected behavior as the data is not cost-effectively usable for exponentially growing data?



Querying big query cost $5 per TB & streaming inserts cost ~$50 per TB










share|improve this question



























    0














    I have a configured google analytics raw data export to big query.



    Could anyone from the community suggest efficient ways to query the intraday data as we noticed the problem for Intraday Sync (e.g. 15 minutes delay), the streaming data is growing exponentially across the sync frequency.



    For example:




    • Everyday (T-1) batch data (ga_sessions_yyymmdd) syncs with 15-20GB with 3.5M-5M records.

    • On the other side, the intraday data streams (with 15 min delay) more than ~150GB per day with ~30M records.


    enter image description here
    https://issuetracker.google.com/issues/117064598



    It's not cost-effective for persisting & querying the data.



    And, is this a product bug or expected behavior as the data is not cost-effectively usable for exponentially growing data?



    Querying big query cost $5 per TB & streaming inserts cost ~$50 per TB










    share|improve this question

























      0












      0








      0







      I have a configured google analytics raw data export to big query.



      Could anyone from the community suggest efficient ways to query the intraday data as we noticed the problem for Intraday Sync (e.g. 15 minutes delay), the streaming data is growing exponentially across the sync frequency.



      For example:




      • Everyday (T-1) batch data (ga_sessions_yyymmdd) syncs with 15-20GB with 3.5M-5M records.

      • On the other side, the intraday data streams (with 15 min delay) more than ~150GB per day with ~30M records.


      enter image description here
      https://issuetracker.google.com/issues/117064598



      It's not cost-effective for persisting & querying the data.



      And, is this a product bug or expected behavior as the data is not cost-effectively usable for exponentially growing data?



      Querying big query cost $5 per TB & streaming inserts cost ~$50 per TB










      share|improve this question













      I have a configured google analytics raw data export to big query.



      Could anyone from the community suggest efficient ways to query the intraday data as we noticed the problem for Intraday Sync (e.g. 15 minutes delay), the streaming data is growing exponentially across the sync frequency.



      For example:




      • Everyday (T-1) batch data (ga_sessions_yyymmdd) syncs with 15-20GB with 3.5M-5M records.

      • On the other side, the intraday data streams (with 15 min delay) more than ~150GB per day with ~30M records.


      enter image description here
      https://issuetracker.google.com/issues/117064598



      It's not cost-effective for persisting & querying the data.



      And, is this a product bug or expected behavior as the data is not cost-effectively usable for exponentially growing data?



      Querying big query cost $5 per TB & streaming inserts cost ~$50 per TB







      google-analytics google-cloud-platform google-bigquery






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 21 at 5:16









      Logan

      60711028




      60711028
























          1 Answer
          1






          active

          oldest

          votes


















          0














          In my vision, it is not a bug, it is a consequence of how data is structured in Google Analytics.



          Each row is a session, and inside each session you have a number of hits. As we can't afford to wait until a session is completely finished, everytime a new hit (or group of hits) occurs the whole session needs to be exported again to BQ. Updating the row is not an option in a streaming system (at least in BigQuery).



          I have already created some stream pipelines in Google Dataflow with Session Windows (not sure if it is what Google uses internally), and I faced the same dilemma: wait to export the aggregate only once, or export continuously and have the exponential growth.



          An advice that I can give you about querying the ga_realtime_sessions table is:




          • Only query for the columns you really need (no select *);

          • use the view that is exported in conjunction with the daily ga_realtime_sessions_yyyymmdd, it doesn't affect the size of the query, but it will prevent you from using duplicated data.






          share|improve this answer





















            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53405649%2fbest-handle-intraday-ga-data-in-bigquery%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            In my vision, it is not a bug, it is a consequence of how data is structured in Google Analytics.



            Each row is a session, and inside each session you have a number of hits. As we can't afford to wait until a session is completely finished, everytime a new hit (or group of hits) occurs the whole session needs to be exported again to BQ. Updating the row is not an option in a streaming system (at least in BigQuery).



            I have already created some stream pipelines in Google Dataflow with Session Windows (not sure if it is what Google uses internally), and I faced the same dilemma: wait to export the aggregate only once, or export continuously and have the exponential growth.



            An advice that I can give you about querying the ga_realtime_sessions table is:




            • Only query for the columns you really need (no select *);

            • use the view that is exported in conjunction with the daily ga_realtime_sessions_yyyymmdd, it doesn't affect the size of the query, but it will prevent you from using duplicated data.






            share|improve this answer


























              0














              In my vision, it is not a bug, it is a consequence of how data is structured in Google Analytics.



              Each row is a session, and inside each session you have a number of hits. As we can't afford to wait until a session is completely finished, everytime a new hit (or group of hits) occurs the whole session needs to be exported again to BQ. Updating the row is not an option in a streaming system (at least in BigQuery).



              I have already created some stream pipelines in Google Dataflow with Session Windows (not sure if it is what Google uses internally), and I faced the same dilemma: wait to export the aggregate only once, or export continuously and have the exponential growth.



              An advice that I can give you about querying the ga_realtime_sessions table is:




              • Only query for the columns you really need (no select *);

              • use the view that is exported in conjunction with the daily ga_realtime_sessions_yyyymmdd, it doesn't affect the size of the query, but it will prevent you from using duplicated data.






              share|improve this answer
























                0












                0








                0






                In my vision, it is not a bug, it is a consequence of how data is structured in Google Analytics.



                Each row is a session, and inside each session you have a number of hits. As we can't afford to wait until a session is completely finished, everytime a new hit (or group of hits) occurs the whole session needs to be exported again to BQ. Updating the row is not an option in a streaming system (at least in BigQuery).



                I have already created some stream pipelines in Google Dataflow with Session Windows (not sure if it is what Google uses internally), and I faced the same dilemma: wait to export the aggregate only once, or export continuously and have the exponential growth.



                An advice that I can give you about querying the ga_realtime_sessions table is:




                • Only query for the columns you really need (no select *);

                • use the view that is exported in conjunction with the daily ga_realtime_sessions_yyyymmdd, it doesn't affect the size of the query, but it will prevent you from using duplicated data.






                share|improve this answer












                In my vision, it is not a bug, it is a consequence of how data is structured in Google Analytics.



                Each row is a session, and inside each session you have a number of hits. As we can't afford to wait until a session is completely finished, everytime a new hit (or group of hits) occurs the whole session needs to be exported again to BQ. Updating the row is not an option in a streaming system (at least in BigQuery).



                I have already created some stream pipelines in Google Dataflow with Session Windows (not sure if it is what Google uses internally), and I faced the same dilemma: wait to export the aggregate only once, or export continuously and have the exponential growth.



                An advice that I can give you about querying the ga_realtime_sessions table is:




                • Only query for the columns you really need (no select *);

                • use the view that is exported in conjunction with the daily ga_realtime_sessions_yyyymmdd, it doesn't affect the size of the query, but it will prevent you from using duplicated data.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 27 at 11:47









                Fernando Nunes

                236




                236






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53405649%2fbest-handle-intraday-ga-data-in-bigquery%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    404 Error Contact Form 7 ajax form submitting

                    How to know if a Active Directory user can login interactively

                    Refactoring coordinates for Minecraft Pi buildings written in Python