Use Window to count lines with if condition in scala











up vote
1
down vote

favorite
1












I hope that you might help me :-)



I have a dataframe with posted advert .
I want, for each id of advert to count the number of advert posted in the 2 month preceding this one, by the same email.



I created the dataframe below to explain things better:



var df = sc.parallelize(Array(
(1, "2017-06-29 10:53:53.0","boulanger.fr" ,"2017-06-28","2017-04-29"),
(2, "2017-07-05 10:48:57.0","patissier.fr","2017-07-04","2017-05-05"),
(3, "2017-06-28 10:31:42.0","boulanger.fr" ,"2017-08-16","2017-06-17"),
(4, "2017-08-21 17:31:12.0","patissier.fr","2017-08-20","2017-06-21"),
(5, "2017-07-28 11:22:42.0","boulanger.fr" ,"2017-08-22","2017-06-23"),
(6, "2017-08-23 17:03:43.0","patissier.fr","2017-08-22","2017-06-23"),
(7, "2017-08-24 16:08:07.0","boulanger.fr" ,"2017-08-23","2017-06-24"),
(8, "2017-08-31 17:20:43.0","patissier.fr","2017-08-30","2017-06-30"),
(9, "2017-09-04 14:35:38.0","boulanger.fr" ,"2017-09-03","2017-07-04"),
(10, "2017-09-07 15:10:34.0","patissier.fr","2017-09-06","2017-07-07"))).toDF("id_advert", "creation_date",
"email", "date_minus1","date_minus2m")

df = df.withColumn("date_minus1", to_date(unix_timestamp($"date_minus1", "yyyy-MM-dd").cast("timestamp")))
df = df.withColumn("date_minus2", to_date(unix_timestamp($"date_minus2", "yyyy-MM-dd").cast("timestamp")))
df = df.withColumn("date_crecreation", (unix_timestamp($"creation_date", "yyyy-MM-dd HH:mm:ss").cast("timestamp")))



  • date_minus1 = the day before the advert was posted

  • date_minus2m = 2 month before the advert was posted


I want to count the number of advert, with the same email, between those 2 dates...



What I want as a result is:



+---------+----------------+
|id_advert|nb_prev_advert |
+---------+----------------+
|6 |2 |
|3 |3 |
|5 |3 |
|9 |2 |
|4 |1 |
|8 |3 |
|7 |3 |
|10 |3 |
+--------+-----------------+


I manage to do that with an awfull join from the dataframe by itself but as I have millions of lines it took almost 2 hours to run...



I am sur the we can do something like:



val w = Window.partitionBy("id_advert").orderBy("creation_date").rowsBetween(-50000000, -1)


And use it to go across the dataframe and count only row with




  • email of the row = email of the current_row

  • date_minus2m of the row< date creation of the current row < date_minus1 of the row










share|improve this question




























    up vote
    1
    down vote

    favorite
    1












    I hope that you might help me :-)



    I have a dataframe with posted advert .
    I want, for each id of advert to count the number of advert posted in the 2 month preceding this one, by the same email.



    I created the dataframe below to explain things better:



    var df = sc.parallelize(Array(
    (1, "2017-06-29 10:53:53.0","boulanger.fr" ,"2017-06-28","2017-04-29"),
    (2, "2017-07-05 10:48:57.0","patissier.fr","2017-07-04","2017-05-05"),
    (3, "2017-06-28 10:31:42.0","boulanger.fr" ,"2017-08-16","2017-06-17"),
    (4, "2017-08-21 17:31:12.0","patissier.fr","2017-08-20","2017-06-21"),
    (5, "2017-07-28 11:22:42.0","boulanger.fr" ,"2017-08-22","2017-06-23"),
    (6, "2017-08-23 17:03:43.0","patissier.fr","2017-08-22","2017-06-23"),
    (7, "2017-08-24 16:08:07.0","boulanger.fr" ,"2017-08-23","2017-06-24"),
    (8, "2017-08-31 17:20:43.0","patissier.fr","2017-08-30","2017-06-30"),
    (9, "2017-09-04 14:35:38.0","boulanger.fr" ,"2017-09-03","2017-07-04"),
    (10, "2017-09-07 15:10:34.0","patissier.fr","2017-09-06","2017-07-07"))).toDF("id_advert", "creation_date",
    "email", "date_minus1","date_minus2m")

    df = df.withColumn("date_minus1", to_date(unix_timestamp($"date_minus1", "yyyy-MM-dd").cast("timestamp")))
    df = df.withColumn("date_minus2", to_date(unix_timestamp($"date_minus2", "yyyy-MM-dd").cast("timestamp")))
    df = df.withColumn("date_crecreation", (unix_timestamp($"creation_date", "yyyy-MM-dd HH:mm:ss").cast("timestamp")))



    • date_minus1 = the day before the advert was posted

    • date_minus2m = 2 month before the advert was posted


    I want to count the number of advert, with the same email, between those 2 dates...



    What I want as a result is:



    +---------+----------------+
    |id_advert|nb_prev_advert |
    +---------+----------------+
    |6 |2 |
    |3 |3 |
    |5 |3 |
    |9 |2 |
    |4 |1 |
    |8 |3 |
    |7 |3 |
    |10 |3 |
    +--------+-----------------+


    I manage to do that with an awfull join from the dataframe by itself but as I have millions of lines it took almost 2 hours to run...



    I am sur the we can do something like:



    val w = Window.partitionBy("id_advert").orderBy("creation_date").rowsBetween(-50000000, -1)


    And use it to go across the dataframe and count only row with




    • email of the row = email of the current_row

    • date_minus2m of the row< date creation of the current row < date_minus1 of the row










    share|improve this question


























      up vote
      1
      down vote

      favorite
      1









      up vote
      1
      down vote

      favorite
      1






      1





      I hope that you might help me :-)



      I have a dataframe with posted advert .
      I want, for each id of advert to count the number of advert posted in the 2 month preceding this one, by the same email.



      I created the dataframe below to explain things better:



      var df = sc.parallelize(Array(
      (1, "2017-06-29 10:53:53.0","boulanger.fr" ,"2017-06-28","2017-04-29"),
      (2, "2017-07-05 10:48:57.0","patissier.fr","2017-07-04","2017-05-05"),
      (3, "2017-06-28 10:31:42.0","boulanger.fr" ,"2017-08-16","2017-06-17"),
      (4, "2017-08-21 17:31:12.0","patissier.fr","2017-08-20","2017-06-21"),
      (5, "2017-07-28 11:22:42.0","boulanger.fr" ,"2017-08-22","2017-06-23"),
      (6, "2017-08-23 17:03:43.0","patissier.fr","2017-08-22","2017-06-23"),
      (7, "2017-08-24 16:08:07.0","boulanger.fr" ,"2017-08-23","2017-06-24"),
      (8, "2017-08-31 17:20:43.0","patissier.fr","2017-08-30","2017-06-30"),
      (9, "2017-09-04 14:35:38.0","boulanger.fr" ,"2017-09-03","2017-07-04"),
      (10, "2017-09-07 15:10:34.0","patissier.fr","2017-09-06","2017-07-07"))).toDF("id_advert", "creation_date",
      "email", "date_minus1","date_minus2m")

      df = df.withColumn("date_minus1", to_date(unix_timestamp($"date_minus1", "yyyy-MM-dd").cast("timestamp")))
      df = df.withColumn("date_minus2", to_date(unix_timestamp($"date_minus2", "yyyy-MM-dd").cast("timestamp")))
      df = df.withColumn("date_crecreation", (unix_timestamp($"creation_date", "yyyy-MM-dd HH:mm:ss").cast("timestamp")))



      • date_minus1 = the day before the advert was posted

      • date_minus2m = 2 month before the advert was posted


      I want to count the number of advert, with the same email, between those 2 dates...



      What I want as a result is:



      +---------+----------------+
      |id_advert|nb_prev_advert |
      +---------+----------------+
      |6 |2 |
      |3 |3 |
      |5 |3 |
      |9 |2 |
      |4 |1 |
      |8 |3 |
      |7 |3 |
      |10 |3 |
      +--------+-----------------+


      I manage to do that with an awfull join from the dataframe by itself but as I have millions of lines it took almost 2 hours to run...



      I am sur the we can do something like:



      val w = Window.partitionBy("id_advert").orderBy("creation_date").rowsBetween(-50000000, -1)


      And use it to go across the dataframe and count only row with




      • email of the row = email of the current_row

      • date_minus2m of the row< date creation of the current row < date_minus1 of the row










      share|improve this question















      I hope that you might help me :-)



      I have a dataframe with posted advert .
      I want, for each id of advert to count the number of advert posted in the 2 month preceding this one, by the same email.



      I created the dataframe below to explain things better:



      var df = sc.parallelize(Array(
      (1, "2017-06-29 10:53:53.0","boulanger.fr" ,"2017-06-28","2017-04-29"),
      (2, "2017-07-05 10:48:57.0","patissier.fr","2017-07-04","2017-05-05"),
      (3, "2017-06-28 10:31:42.0","boulanger.fr" ,"2017-08-16","2017-06-17"),
      (4, "2017-08-21 17:31:12.0","patissier.fr","2017-08-20","2017-06-21"),
      (5, "2017-07-28 11:22:42.0","boulanger.fr" ,"2017-08-22","2017-06-23"),
      (6, "2017-08-23 17:03:43.0","patissier.fr","2017-08-22","2017-06-23"),
      (7, "2017-08-24 16:08:07.0","boulanger.fr" ,"2017-08-23","2017-06-24"),
      (8, "2017-08-31 17:20:43.0","patissier.fr","2017-08-30","2017-06-30"),
      (9, "2017-09-04 14:35:38.0","boulanger.fr" ,"2017-09-03","2017-07-04"),
      (10, "2017-09-07 15:10:34.0","patissier.fr","2017-09-06","2017-07-07"))).toDF("id_advert", "creation_date",
      "email", "date_minus1","date_minus2m")

      df = df.withColumn("date_minus1", to_date(unix_timestamp($"date_minus1", "yyyy-MM-dd").cast("timestamp")))
      df = df.withColumn("date_minus2", to_date(unix_timestamp($"date_minus2", "yyyy-MM-dd").cast("timestamp")))
      df = df.withColumn("date_crecreation", (unix_timestamp($"creation_date", "yyyy-MM-dd HH:mm:ss").cast("timestamp")))



      • date_minus1 = the day before the advert was posted

      • date_minus2m = 2 month before the advert was posted


      I want to count the number of advert, with the same email, between those 2 dates...



      What I want as a result is:



      +---------+----------------+
      |id_advert|nb_prev_advert |
      +---------+----------------+
      |6 |2 |
      |3 |3 |
      |5 |3 |
      |9 |2 |
      |4 |1 |
      |8 |3 |
      |7 |3 |
      |10 |3 |
      +--------+-----------------+


      I manage to do that with an awfull join from the dataframe by itself but as I have millions of lines it took almost 2 hours to run...



      I am sur the we can do something like:



      val w = Window.partitionBy("id_advert").orderBy("creation_date").rowsBetween(-50000000, -1)


      And use it to go across the dataframe and count only row with




      • email of the row = email of the current_row

      • date_minus2m of the row< date creation of the current row < date_minus1 of the row







      scala apache-spark






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 16 at 8:35

























      asked Nov 16 at 7:58









      Anneso

      769




      769
























          3 Answers
          3






          active

          oldest

          votes

















          up vote
          1
          down vote



          accepted










          Adding this as different answer because it is different



          Input:



          df.select("*").orderBy("email","creation_date").show()

          +---------+--------------------+------------+----+
          |id_advert| creation_date| email|sold|
          +---------+--------------------+------------+----+
          | 1|2015-06-29 10:53:...|boulanger.fr| 1|
          | 5|2015-07-28 11:22:...|boulanger.fr| 0|
          | 3|2017-06-28 10:31:...|boulanger.fr| 1|
          | 7|2017-08-24 16:08:...|boulanger.fr| 1|
          | 9|2017-09-04 14:35:...|boulanger.fr| 1|
          | 10|2012-09-07 15:10:...|patissier.fr| 0|
          | 8|2014-08-31 17:20:...|patissier.fr| 1|
          | 2|2016-07-05 10:48:...|patissier.fr| 1|
          | 4|2017-08-21 17:31:...|patissier.fr| 0|
          | 6|2017-08-23 17:03:...|patissier.fr| 0|
          +---------+--------------------+------------+----+


          Now you define your window spec as something like this



          val w = Window.
          partitionBy("email").
          orderBy(col("creation_date").
          cast("timestamp").
          cast("long")).rangeBetween(-60*24*60*60,-1)


          And the main query will be:



          df.
          select(
          col("*"),count("email").over(w).alias("all_prev_mail_advert"),
          sum("sold").over(w).alias("all_prev_sold_mail_advert")
          ).orderBy("email","creation_date").show()


          Output:



          +---------+--------------------+------------+----+--------------------+-------------------------+
          |id_advert| creation_date| email|sold|all_prev_mail_advert|all_prev_sold_mail_advert|
          +---------+--------------------+------------+----+--------------------+-------------------------+
          | 1|2015-06-29 10:53:...|boulanger.fr| 1| 0| null|
          | 5|2015-07-28 11:22:...|boulanger.fr| 0| 1| 1|
          | 3|2017-06-28 10:31:...|boulanger.fr| 1| 0| null|
          | 7|2017-08-24 16:08:...|boulanger.fr| 1| 1| 1|
          | 9|2017-09-04 14:35:...|boulanger.fr| 1| 1| 1|
          | 10|2012-09-07 15:10:...|patissier.fr| 0| 0| null|
          | 8|2014-08-31 17:20:...|patissier.fr| 1| 0| null|
          | 2|2016-07-05 10:48:...|patissier.fr| 1| 0| null|
          | 4|2017-08-21 17:31:...|patissier.fr| 0| 0| null|
          | 6|2017-08-23 17:03:...|patissier.fr| 0| 1| 0|
          +---------+--------------------+------------+----+--------------------+-------------------------+


          Explanation:



          We are defining a window function for the last two months partitioned by email. And the count over this window gives all the previous advert for the same email.



          And to get all the previous sold advert we are simply adding the sold column over the same window. As sold is 1 for sold item, the sum gives the count of all the sold item over this window.






          share|improve this answer




























            up vote
            1
            down vote













            Here is the answer with using Window with a range



            Create a window spec with range between current and past sixty days



            val w = Window
            .partitionBy(col("email"))
            .orderBy(col("creation_date").cast("timestamp").cast("long"))
            .rangeBetween(-60*86400,-1)


            Then select it over your data frame



            df
            .select(col("*"),count("email").over(w).alias("trailing_count"))
            .orderBy("email","creation_date") //using this for display purpose
            .show()


            Note: Your expected output might be wrong. One, there would be at least a zero for a advert because something must be starting row for a mail. Also, count for advertid 3 seems wrong.



            Input Data :



            df.select("id_advert","creation_date","email").orderBy("email", "creation_date").show()

            +---------+--------------------+------------+
            |id_advert| creation_date| email|
            +---------+--------------------+------------+
            | 3|2017-06-28 10:31:...|boulanger.fr|
            | 1|2017-06-29 10:53:...|boulanger.fr|
            | 5|2017-07-28 11:22:...|boulanger.fr|
            | 7|2017-08-24 16:08:...|boulanger.fr|
            | 9|2017-09-04 14:35:...|boulanger.fr|
            | 2|2017-07-05 10:48:...|patissier.fr|
            | 4|2017-08-21 17:31:...|patissier.fr|
            | 6|2017-08-23 17:03:...|patissier.fr|
            | 8|2017-08-31 17:20:...|patissier.fr|
            | 10|2017-09-07 15:10:...|patissier.fr|
            +---------+--------------------+------------+


            Output:



            +---------+--------------------+------------+-------------+--------------+
            |id_advert| creation_date| email|date_creation|trailing_count|
            +---------+--------------------+------------+-------------+--------------+
            | 3|2017-06-28 10:31:...|boulanger.fr| 1498645902| 0|
            | 1|2017-06-29 10:53:...|boulanger.fr| 1498733633| 1|
            | 5|2017-07-28 11:22:...|boulanger.fr| 1501240962| 2|
            | 7|2017-08-24 16:08:...|boulanger.fr| 1503590887| 3|
            | 9|2017-09-04 14:35:...|boulanger.fr| 1504535738| 2|
            | 2|2017-07-05 10:48:...|patissier.fr| 1499251737| 0|
            | 4|2017-08-21 17:31:...|patissier.fr| 1503336672| 1|
            | 6|2017-08-23 17:03:...|patissier.fr| 1503507823| 2|
            | 8|2017-08-31 17:20:...|patissier.fr| 1504200043| 3|
            | 10|2017-09-07 15:10:...|patissier.fr| 1504797034| 3|
            +---------+--------------------+------------+-------------+--------------+





            share|improve this answer





















            • Not sure. What is the problem for you. If you are running in sbt-console/spark-shell, I will make sure that the whole val w= ...... is in a single line. As I assume, w is just getting evaluated to only "Window" part. If possible post your whole code.
              – Biswanath
              Nov 19 at 15:12










            • Thank you very much!!! It works! However I just do one modification to avoid an error: "df.withColumn("trailing_count", count("email") over w).show"
              – Anneso
              Nov 19 at 15:26










            • It was not easy to explain what I do not understand so I made another question (that I posted as an answer... sorry for that)
              – Anneso
              Nov 19 at 15:54


















            up vote
            0
            down vote













            As it impossible to structure correctly a comment I will use the answer button but it is actually more a question than an answer.



            I simplify the problem thinking that with your answer I might be able to do what I want to do but I am not sure to understand correclty your answer...



            How does it work? To me:




            • if I do .rangeBetween(-3,-1) I will use a window which look 3 line before the current line to one line before the current line. But here it seems that rangeBetween is refering to the orderby variable and not the number total of lines..???

            • if I do "partitionBy(col("email"))" I should have one line by email but here i still get oneline by advert_id...


            What I really want to do is count respectively, the number of sold item and the number of un-sold items in the 2 months preceding the advert post date, by the same email.



            Is it an easy way to use what you did and apply it to my real issue?



            My dataframe look like this:



            var df = sc.parallelize(Array(
            (1, "2015-06-29 10:53:53.0","boulanger.fr", 1),
            (2, "2016-07-05 10:48:57.0","patissier.fr", 1),
            (3, "2017-06-28 10:31:42.0","boulanger.fr", 1),
            (4, "2017-08-21 17:31:12.0","patissier.fr", 0),
            (5, "2015-07-28 11:22:42.0","boulanger.fr", 0),
            (6, "2017-08-23 17:03:43.0","patissier.fr", 0),
            (7, "2017-08-24 16:08:07.0","boulanger.fr", 1),
            (8, "2014-08-31 17:20:43.0","patissier.fr", 1),
            (9, "2017-09-04 14:35:38.0","boulanger.fr", 1),
            (10, "2012-09-07 15:10:34.0","patissier.fr", 0))).toDF("id_advert", "creation_date","email", "sold")


            For each id_advert I would like to have 2 lines. One for the number of sold items and one for the number of un-sold items...



            Thank you in advance!!! If it is not possible for you to unswer I will do it in a more durty way ;-).






            share|improve this answer





















            • So you want sold and unsold count rather than, count of advert by the same mailid ? Am I right. Also, would it work if both sold and unsold are part of the row i.e. come as two columns ?
              – Biswanath
              Nov 19 at 16:20










            • Yes that is exactly what I want. A dataframe with 3 columns (id_advert, sold, number_previous). I suppose that if you manage to gave me a result as 2 columns I should manage to get 2 row so yes it would be great!!! :-)
              – Anneso
              Nov 19 at 16:29











            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














             

            draft saved


            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53333635%2fuse-window-to-count-lines-with-if-condition-in-scala%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            3 Answers
            3






            active

            oldest

            votes








            3 Answers
            3






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            1
            down vote



            accepted










            Adding this as different answer because it is different



            Input:



            df.select("*").orderBy("email","creation_date").show()

            +---------+--------------------+------------+----+
            |id_advert| creation_date| email|sold|
            +---------+--------------------+------------+----+
            | 1|2015-06-29 10:53:...|boulanger.fr| 1|
            | 5|2015-07-28 11:22:...|boulanger.fr| 0|
            | 3|2017-06-28 10:31:...|boulanger.fr| 1|
            | 7|2017-08-24 16:08:...|boulanger.fr| 1|
            | 9|2017-09-04 14:35:...|boulanger.fr| 1|
            | 10|2012-09-07 15:10:...|patissier.fr| 0|
            | 8|2014-08-31 17:20:...|patissier.fr| 1|
            | 2|2016-07-05 10:48:...|patissier.fr| 1|
            | 4|2017-08-21 17:31:...|patissier.fr| 0|
            | 6|2017-08-23 17:03:...|patissier.fr| 0|
            +---------+--------------------+------------+----+


            Now you define your window spec as something like this



            val w = Window.
            partitionBy("email").
            orderBy(col("creation_date").
            cast("timestamp").
            cast("long")).rangeBetween(-60*24*60*60,-1)


            And the main query will be:



            df.
            select(
            col("*"),count("email").over(w).alias("all_prev_mail_advert"),
            sum("sold").over(w).alias("all_prev_sold_mail_advert")
            ).orderBy("email","creation_date").show()


            Output:



            +---------+--------------------+------------+----+--------------------+-------------------------+
            |id_advert| creation_date| email|sold|all_prev_mail_advert|all_prev_sold_mail_advert|
            +---------+--------------------+------------+----+--------------------+-------------------------+
            | 1|2015-06-29 10:53:...|boulanger.fr| 1| 0| null|
            | 5|2015-07-28 11:22:...|boulanger.fr| 0| 1| 1|
            | 3|2017-06-28 10:31:...|boulanger.fr| 1| 0| null|
            | 7|2017-08-24 16:08:...|boulanger.fr| 1| 1| 1|
            | 9|2017-09-04 14:35:...|boulanger.fr| 1| 1| 1|
            | 10|2012-09-07 15:10:...|patissier.fr| 0| 0| null|
            | 8|2014-08-31 17:20:...|patissier.fr| 1| 0| null|
            | 2|2016-07-05 10:48:...|patissier.fr| 1| 0| null|
            | 4|2017-08-21 17:31:...|patissier.fr| 0| 0| null|
            | 6|2017-08-23 17:03:...|patissier.fr| 0| 1| 0|
            +---------+--------------------+------------+----+--------------------+-------------------------+


            Explanation:



            We are defining a window function for the last two months partitioned by email. And the count over this window gives all the previous advert for the same email.



            And to get all the previous sold advert we are simply adding the sold column over the same window. As sold is 1 for sold item, the sum gives the count of all the sold item over this window.






            share|improve this answer

























              up vote
              1
              down vote



              accepted










              Adding this as different answer because it is different



              Input:



              df.select("*").orderBy("email","creation_date").show()

              +---------+--------------------+------------+----+
              |id_advert| creation_date| email|sold|
              +---------+--------------------+------------+----+
              | 1|2015-06-29 10:53:...|boulanger.fr| 1|
              | 5|2015-07-28 11:22:...|boulanger.fr| 0|
              | 3|2017-06-28 10:31:...|boulanger.fr| 1|
              | 7|2017-08-24 16:08:...|boulanger.fr| 1|
              | 9|2017-09-04 14:35:...|boulanger.fr| 1|
              | 10|2012-09-07 15:10:...|patissier.fr| 0|
              | 8|2014-08-31 17:20:...|patissier.fr| 1|
              | 2|2016-07-05 10:48:...|patissier.fr| 1|
              | 4|2017-08-21 17:31:...|patissier.fr| 0|
              | 6|2017-08-23 17:03:...|patissier.fr| 0|
              +---------+--------------------+------------+----+


              Now you define your window spec as something like this



              val w = Window.
              partitionBy("email").
              orderBy(col("creation_date").
              cast("timestamp").
              cast("long")).rangeBetween(-60*24*60*60,-1)


              And the main query will be:



              df.
              select(
              col("*"),count("email").over(w).alias("all_prev_mail_advert"),
              sum("sold").over(w).alias("all_prev_sold_mail_advert")
              ).orderBy("email","creation_date").show()


              Output:



              +---------+--------------------+------------+----+--------------------+-------------------------+
              |id_advert| creation_date| email|sold|all_prev_mail_advert|all_prev_sold_mail_advert|
              +---------+--------------------+------------+----+--------------------+-------------------------+
              | 1|2015-06-29 10:53:...|boulanger.fr| 1| 0| null|
              | 5|2015-07-28 11:22:...|boulanger.fr| 0| 1| 1|
              | 3|2017-06-28 10:31:...|boulanger.fr| 1| 0| null|
              | 7|2017-08-24 16:08:...|boulanger.fr| 1| 1| 1|
              | 9|2017-09-04 14:35:...|boulanger.fr| 1| 1| 1|
              | 10|2012-09-07 15:10:...|patissier.fr| 0| 0| null|
              | 8|2014-08-31 17:20:...|patissier.fr| 1| 0| null|
              | 2|2016-07-05 10:48:...|patissier.fr| 1| 0| null|
              | 4|2017-08-21 17:31:...|patissier.fr| 0| 0| null|
              | 6|2017-08-23 17:03:...|patissier.fr| 0| 1| 0|
              +---------+--------------------+------------+----+--------------------+-------------------------+


              Explanation:



              We are defining a window function for the last two months partitioned by email. And the count over this window gives all the previous advert for the same email.



              And to get all the previous sold advert we are simply adding the sold column over the same window. As sold is 1 for sold item, the sum gives the count of all the sold item over this window.






              share|improve this answer























                up vote
                1
                down vote



                accepted







                up vote
                1
                down vote



                accepted






                Adding this as different answer because it is different



                Input:



                df.select("*").orderBy("email","creation_date").show()

                +---------+--------------------+------------+----+
                |id_advert| creation_date| email|sold|
                +---------+--------------------+------------+----+
                | 1|2015-06-29 10:53:...|boulanger.fr| 1|
                | 5|2015-07-28 11:22:...|boulanger.fr| 0|
                | 3|2017-06-28 10:31:...|boulanger.fr| 1|
                | 7|2017-08-24 16:08:...|boulanger.fr| 1|
                | 9|2017-09-04 14:35:...|boulanger.fr| 1|
                | 10|2012-09-07 15:10:...|patissier.fr| 0|
                | 8|2014-08-31 17:20:...|patissier.fr| 1|
                | 2|2016-07-05 10:48:...|patissier.fr| 1|
                | 4|2017-08-21 17:31:...|patissier.fr| 0|
                | 6|2017-08-23 17:03:...|patissier.fr| 0|
                +---------+--------------------+------------+----+


                Now you define your window spec as something like this



                val w = Window.
                partitionBy("email").
                orderBy(col("creation_date").
                cast("timestamp").
                cast("long")).rangeBetween(-60*24*60*60,-1)


                And the main query will be:



                df.
                select(
                col("*"),count("email").over(w).alias("all_prev_mail_advert"),
                sum("sold").over(w).alias("all_prev_sold_mail_advert")
                ).orderBy("email","creation_date").show()


                Output:



                +---------+--------------------+------------+----+--------------------+-------------------------+
                |id_advert| creation_date| email|sold|all_prev_mail_advert|all_prev_sold_mail_advert|
                +---------+--------------------+------------+----+--------------------+-------------------------+
                | 1|2015-06-29 10:53:...|boulanger.fr| 1| 0| null|
                | 5|2015-07-28 11:22:...|boulanger.fr| 0| 1| 1|
                | 3|2017-06-28 10:31:...|boulanger.fr| 1| 0| null|
                | 7|2017-08-24 16:08:...|boulanger.fr| 1| 1| 1|
                | 9|2017-09-04 14:35:...|boulanger.fr| 1| 1| 1|
                | 10|2012-09-07 15:10:...|patissier.fr| 0| 0| null|
                | 8|2014-08-31 17:20:...|patissier.fr| 1| 0| null|
                | 2|2016-07-05 10:48:...|patissier.fr| 1| 0| null|
                | 4|2017-08-21 17:31:...|patissier.fr| 0| 0| null|
                | 6|2017-08-23 17:03:...|patissier.fr| 0| 1| 0|
                +---------+--------------------+------------+----+--------------------+-------------------------+


                Explanation:



                We are defining a window function for the last two months partitioned by email. And the count over this window gives all the previous advert for the same email.



                And to get all the previous sold advert we are simply adding the sold column over the same window. As sold is 1 for sold item, the sum gives the count of all the sold item over this window.






                share|improve this answer












                Adding this as different answer because it is different



                Input:



                df.select("*").orderBy("email","creation_date").show()

                +---------+--------------------+------------+----+
                |id_advert| creation_date| email|sold|
                +---------+--------------------+------------+----+
                | 1|2015-06-29 10:53:...|boulanger.fr| 1|
                | 5|2015-07-28 11:22:...|boulanger.fr| 0|
                | 3|2017-06-28 10:31:...|boulanger.fr| 1|
                | 7|2017-08-24 16:08:...|boulanger.fr| 1|
                | 9|2017-09-04 14:35:...|boulanger.fr| 1|
                | 10|2012-09-07 15:10:...|patissier.fr| 0|
                | 8|2014-08-31 17:20:...|patissier.fr| 1|
                | 2|2016-07-05 10:48:...|patissier.fr| 1|
                | 4|2017-08-21 17:31:...|patissier.fr| 0|
                | 6|2017-08-23 17:03:...|patissier.fr| 0|
                +---------+--------------------+------------+----+


                Now you define your window spec as something like this



                val w = Window.
                partitionBy("email").
                orderBy(col("creation_date").
                cast("timestamp").
                cast("long")).rangeBetween(-60*24*60*60,-1)


                And the main query will be:



                df.
                select(
                col("*"),count("email").over(w).alias("all_prev_mail_advert"),
                sum("sold").over(w).alias("all_prev_sold_mail_advert")
                ).orderBy("email","creation_date").show()


                Output:



                +---------+--------------------+------------+----+--------------------+-------------------------+
                |id_advert| creation_date| email|sold|all_prev_mail_advert|all_prev_sold_mail_advert|
                +---------+--------------------+------------+----+--------------------+-------------------------+
                | 1|2015-06-29 10:53:...|boulanger.fr| 1| 0| null|
                | 5|2015-07-28 11:22:...|boulanger.fr| 0| 1| 1|
                | 3|2017-06-28 10:31:...|boulanger.fr| 1| 0| null|
                | 7|2017-08-24 16:08:...|boulanger.fr| 1| 1| 1|
                | 9|2017-09-04 14:35:...|boulanger.fr| 1| 1| 1|
                | 10|2012-09-07 15:10:...|patissier.fr| 0| 0| null|
                | 8|2014-08-31 17:20:...|patissier.fr| 1| 0| null|
                | 2|2016-07-05 10:48:...|patissier.fr| 1| 0| null|
                | 4|2017-08-21 17:31:...|patissier.fr| 0| 0| null|
                | 6|2017-08-23 17:03:...|patissier.fr| 0| 1| 0|
                +---------+--------------------+------------+----+--------------------+-------------------------+


                Explanation:



                We are defining a window function for the last two months partitioned by email. And the count over this window gives all the previous advert for the same email.



                And to get all the previous sold advert we are simply adding the sold column over the same window. As sold is 1 for sold item, the sum gives the count of all the sold item over this window.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 19 at 18:01









                Biswanath

                5,106103556




                5,106103556
























                    up vote
                    1
                    down vote













                    Here is the answer with using Window with a range



                    Create a window spec with range between current and past sixty days



                    val w = Window
                    .partitionBy(col("email"))
                    .orderBy(col("creation_date").cast("timestamp").cast("long"))
                    .rangeBetween(-60*86400,-1)


                    Then select it over your data frame



                    df
                    .select(col("*"),count("email").over(w).alias("trailing_count"))
                    .orderBy("email","creation_date") //using this for display purpose
                    .show()


                    Note: Your expected output might be wrong. One, there would be at least a zero for a advert because something must be starting row for a mail. Also, count for advertid 3 seems wrong.



                    Input Data :



                    df.select("id_advert","creation_date","email").orderBy("email", "creation_date").show()

                    +---------+--------------------+------------+
                    |id_advert| creation_date| email|
                    +---------+--------------------+------------+
                    | 3|2017-06-28 10:31:...|boulanger.fr|
                    | 1|2017-06-29 10:53:...|boulanger.fr|
                    | 5|2017-07-28 11:22:...|boulanger.fr|
                    | 7|2017-08-24 16:08:...|boulanger.fr|
                    | 9|2017-09-04 14:35:...|boulanger.fr|
                    | 2|2017-07-05 10:48:...|patissier.fr|
                    | 4|2017-08-21 17:31:...|patissier.fr|
                    | 6|2017-08-23 17:03:...|patissier.fr|
                    | 8|2017-08-31 17:20:...|patissier.fr|
                    | 10|2017-09-07 15:10:...|patissier.fr|
                    +---------+--------------------+------------+


                    Output:



                    +---------+--------------------+------------+-------------+--------------+
                    |id_advert| creation_date| email|date_creation|trailing_count|
                    +---------+--------------------+------------+-------------+--------------+
                    | 3|2017-06-28 10:31:...|boulanger.fr| 1498645902| 0|
                    | 1|2017-06-29 10:53:...|boulanger.fr| 1498733633| 1|
                    | 5|2017-07-28 11:22:...|boulanger.fr| 1501240962| 2|
                    | 7|2017-08-24 16:08:...|boulanger.fr| 1503590887| 3|
                    | 9|2017-09-04 14:35:...|boulanger.fr| 1504535738| 2|
                    | 2|2017-07-05 10:48:...|patissier.fr| 1499251737| 0|
                    | 4|2017-08-21 17:31:...|patissier.fr| 1503336672| 1|
                    | 6|2017-08-23 17:03:...|patissier.fr| 1503507823| 2|
                    | 8|2017-08-31 17:20:...|patissier.fr| 1504200043| 3|
                    | 10|2017-09-07 15:10:...|patissier.fr| 1504797034| 3|
                    +---------+--------------------+------------+-------------+--------------+





                    share|improve this answer





















                    • Not sure. What is the problem for you. If you are running in sbt-console/spark-shell, I will make sure that the whole val w= ...... is in a single line. As I assume, w is just getting evaluated to only "Window" part. If possible post your whole code.
                      – Biswanath
                      Nov 19 at 15:12










                    • Thank you very much!!! It works! However I just do one modification to avoid an error: "df.withColumn("trailing_count", count("email") over w).show"
                      – Anneso
                      Nov 19 at 15:26










                    • It was not easy to explain what I do not understand so I made another question (that I posted as an answer... sorry for that)
                      – Anneso
                      Nov 19 at 15:54















                    up vote
                    1
                    down vote













                    Here is the answer with using Window with a range



                    Create a window spec with range between current and past sixty days



                    val w = Window
                    .partitionBy(col("email"))
                    .orderBy(col("creation_date").cast("timestamp").cast("long"))
                    .rangeBetween(-60*86400,-1)


                    Then select it over your data frame



                    df
                    .select(col("*"),count("email").over(w).alias("trailing_count"))
                    .orderBy("email","creation_date") //using this for display purpose
                    .show()


                    Note: Your expected output might be wrong. One, there would be at least a zero for a advert because something must be starting row for a mail. Also, count for advertid 3 seems wrong.



                    Input Data :



                    df.select("id_advert","creation_date","email").orderBy("email", "creation_date").show()

                    +---------+--------------------+------------+
                    |id_advert| creation_date| email|
                    +---------+--------------------+------------+
                    | 3|2017-06-28 10:31:...|boulanger.fr|
                    | 1|2017-06-29 10:53:...|boulanger.fr|
                    | 5|2017-07-28 11:22:...|boulanger.fr|
                    | 7|2017-08-24 16:08:...|boulanger.fr|
                    | 9|2017-09-04 14:35:...|boulanger.fr|
                    | 2|2017-07-05 10:48:...|patissier.fr|
                    | 4|2017-08-21 17:31:...|patissier.fr|
                    | 6|2017-08-23 17:03:...|patissier.fr|
                    | 8|2017-08-31 17:20:...|patissier.fr|
                    | 10|2017-09-07 15:10:...|patissier.fr|
                    +---------+--------------------+------------+


                    Output:



                    +---------+--------------------+------------+-------------+--------------+
                    |id_advert| creation_date| email|date_creation|trailing_count|
                    +---------+--------------------+------------+-------------+--------------+
                    | 3|2017-06-28 10:31:...|boulanger.fr| 1498645902| 0|
                    | 1|2017-06-29 10:53:...|boulanger.fr| 1498733633| 1|
                    | 5|2017-07-28 11:22:...|boulanger.fr| 1501240962| 2|
                    | 7|2017-08-24 16:08:...|boulanger.fr| 1503590887| 3|
                    | 9|2017-09-04 14:35:...|boulanger.fr| 1504535738| 2|
                    | 2|2017-07-05 10:48:...|patissier.fr| 1499251737| 0|
                    | 4|2017-08-21 17:31:...|patissier.fr| 1503336672| 1|
                    | 6|2017-08-23 17:03:...|patissier.fr| 1503507823| 2|
                    | 8|2017-08-31 17:20:...|patissier.fr| 1504200043| 3|
                    | 10|2017-09-07 15:10:...|patissier.fr| 1504797034| 3|
                    +---------+--------------------+------------+-------------+--------------+





                    share|improve this answer





















                    • Not sure. What is the problem for you. If you are running in sbt-console/spark-shell, I will make sure that the whole val w= ...... is in a single line. As I assume, w is just getting evaluated to only "Window" part. If possible post your whole code.
                      – Biswanath
                      Nov 19 at 15:12










                    • Thank you very much!!! It works! However I just do one modification to avoid an error: "df.withColumn("trailing_count", count("email") over w).show"
                      – Anneso
                      Nov 19 at 15:26










                    • It was not easy to explain what I do not understand so I made another question (that I posted as an answer... sorry for that)
                      – Anneso
                      Nov 19 at 15:54













                    up vote
                    1
                    down vote










                    up vote
                    1
                    down vote









                    Here is the answer with using Window with a range



                    Create a window spec with range between current and past sixty days



                    val w = Window
                    .partitionBy(col("email"))
                    .orderBy(col("creation_date").cast("timestamp").cast("long"))
                    .rangeBetween(-60*86400,-1)


                    Then select it over your data frame



                    df
                    .select(col("*"),count("email").over(w).alias("trailing_count"))
                    .orderBy("email","creation_date") //using this for display purpose
                    .show()


                    Note: Your expected output might be wrong. One, there would be at least a zero for a advert because something must be starting row for a mail. Also, count for advertid 3 seems wrong.



                    Input Data :



                    df.select("id_advert","creation_date","email").orderBy("email", "creation_date").show()

                    +---------+--------------------+------------+
                    |id_advert| creation_date| email|
                    +---------+--------------------+------------+
                    | 3|2017-06-28 10:31:...|boulanger.fr|
                    | 1|2017-06-29 10:53:...|boulanger.fr|
                    | 5|2017-07-28 11:22:...|boulanger.fr|
                    | 7|2017-08-24 16:08:...|boulanger.fr|
                    | 9|2017-09-04 14:35:...|boulanger.fr|
                    | 2|2017-07-05 10:48:...|patissier.fr|
                    | 4|2017-08-21 17:31:...|patissier.fr|
                    | 6|2017-08-23 17:03:...|patissier.fr|
                    | 8|2017-08-31 17:20:...|patissier.fr|
                    | 10|2017-09-07 15:10:...|patissier.fr|
                    +---------+--------------------+------------+


                    Output:



                    +---------+--------------------+------------+-------------+--------------+
                    |id_advert| creation_date| email|date_creation|trailing_count|
                    +---------+--------------------+------------+-------------+--------------+
                    | 3|2017-06-28 10:31:...|boulanger.fr| 1498645902| 0|
                    | 1|2017-06-29 10:53:...|boulanger.fr| 1498733633| 1|
                    | 5|2017-07-28 11:22:...|boulanger.fr| 1501240962| 2|
                    | 7|2017-08-24 16:08:...|boulanger.fr| 1503590887| 3|
                    | 9|2017-09-04 14:35:...|boulanger.fr| 1504535738| 2|
                    | 2|2017-07-05 10:48:...|patissier.fr| 1499251737| 0|
                    | 4|2017-08-21 17:31:...|patissier.fr| 1503336672| 1|
                    | 6|2017-08-23 17:03:...|patissier.fr| 1503507823| 2|
                    | 8|2017-08-31 17:20:...|patissier.fr| 1504200043| 3|
                    | 10|2017-09-07 15:10:...|patissier.fr| 1504797034| 3|
                    +---------+--------------------+------------+-------------+--------------+





                    share|improve this answer












                    Here is the answer with using Window with a range



                    Create a window spec with range between current and past sixty days



                    val w = Window
                    .partitionBy(col("email"))
                    .orderBy(col("creation_date").cast("timestamp").cast("long"))
                    .rangeBetween(-60*86400,-1)


                    Then select it over your data frame



                    df
                    .select(col("*"),count("email").over(w).alias("trailing_count"))
                    .orderBy("email","creation_date") //using this for display purpose
                    .show()


                    Note: Your expected output might be wrong. One, there would be at least a zero for a advert because something must be starting row for a mail. Also, count for advertid 3 seems wrong.



                    Input Data :



                    df.select("id_advert","creation_date","email").orderBy("email", "creation_date").show()

                    +---------+--------------------+------------+
                    |id_advert| creation_date| email|
                    +---------+--------------------+------------+
                    | 3|2017-06-28 10:31:...|boulanger.fr|
                    | 1|2017-06-29 10:53:...|boulanger.fr|
                    | 5|2017-07-28 11:22:...|boulanger.fr|
                    | 7|2017-08-24 16:08:...|boulanger.fr|
                    | 9|2017-09-04 14:35:...|boulanger.fr|
                    | 2|2017-07-05 10:48:...|patissier.fr|
                    | 4|2017-08-21 17:31:...|patissier.fr|
                    | 6|2017-08-23 17:03:...|patissier.fr|
                    | 8|2017-08-31 17:20:...|patissier.fr|
                    | 10|2017-09-07 15:10:...|patissier.fr|
                    +---------+--------------------+------------+


                    Output:



                    +---------+--------------------+------------+-------------+--------------+
                    |id_advert| creation_date| email|date_creation|trailing_count|
                    +---------+--------------------+------------+-------------+--------------+
                    | 3|2017-06-28 10:31:...|boulanger.fr| 1498645902| 0|
                    | 1|2017-06-29 10:53:...|boulanger.fr| 1498733633| 1|
                    | 5|2017-07-28 11:22:...|boulanger.fr| 1501240962| 2|
                    | 7|2017-08-24 16:08:...|boulanger.fr| 1503590887| 3|
                    | 9|2017-09-04 14:35:...|boulanger.fr| 1504535738| 2|
                    | 2|2017-07-05 10:48:...|patissier.fr| 1499251737| 0|
                    | 4|2017-08-21 17:31:...|patissier.fr| 1503336672| 1|
                    | 6|2017-08-23 17:03:...|patissier.fr| 1503507823| 2|
                    | 8|2017-08-31 17:20:...|patissier.fr| 1504200043| 3|
                    | 10|2017-09-07 15:10:...|patissier.fr| 1504797034| 3|
                    +---------+--------------------+------------+-------------+--------------+






                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Nov 18 at 18:45









                    Biswanath

                    5,106103556




                    5,106103556












                    • Not sure. What is the problem for you. If you are running in sbt-console/spark-shell, I will make sure that the whole val w= ...... is in a single line. As I assume, w is just getting evaluated to only "Window" part. If possible post your whole code.
                      – Biswanath
                      Nov 19 at 15:12










                    • Thank you very much!!! It works! However I just do one modification to avoid an error: "df.withColumn("trailing_count", count("email") over w).show"
                      – Anneso
                      Nov 19 at 15:26










                    • It was not easy to explain what I do not understand so I made another question (that I posted as an answer... sorry for that)
                      – Anneso
                      Nov 19 at 15:54


















                    • Not sure. What is the problem for you. If you are running in sbt-console/spark-shell, I will make sure that the whole val w= ...... is in a single line. As I assume, w is just getting evaluated to only "Window" part. If possible post your whole code.
                      – Biswanath
                      Nov 19 at 15:12










                    • Thank you very much!!! It works! However I just do one modification to avoid an error: "df.withColumn("trailing_count", count("email") over w).show"
                      – Anneso
                      Nov 19 at 15:26










                    • It was not easy to explain what I do not understand so I made another question (that I posted as an answer... sorry for that)
                      – Anneso
                      Nov 19 at 15:54
















                    Not sure. What is the problem for you. If you are running in sbt-console/spark-shell, I will make sure that the whole val w= ...... is in a single line. As I assume, w is just getting evaluated to only "Window" part. If possible post your whole code.
                    – Biswanath
                    Nov 19 at 15:12




                    Not sure. What is the problem for you. If you are running in sbt-console/spark-shell, I will make sure that the whole val w= ...... is in a single line. As I assume, w is just getting evaluated to only "Window" part. If possible post your whole code.
                    – Biswanath
                    Nov 19 at 15:12












                    Thank you very much!!! It works! However I just do one modification to avoid an error: "df.withColumn("trailing_count", count("email") over w).show"
                    – Anneso
                    Nov 19 at 15:26




                    Thank you very much!!! It works! However I just do one modification to avoid an error: "df.withColumn("trailing_count", count("email") over w).show"
                    – Anneso
                    Nov 19 at 15:26












                    It was not easy to explain what I do not understand so I made another question (that I posted as an answer... sorry for that)
                    – Anneso
                    Nov 19 at 15:54




                    It was not easy to explain what I do not understand so I made another question (that I posted as an answer... sorry for that)
                    – Anneso
                    Nov 19 at 15:54










                    up vote
                    0
                    down vote













                    As it impossible to structure correctly a comment I will use the answer button but it is actually more a question than an answer.



                    I simplify the problem thinking that with your answer I might be able to do what I want to do but I am not sure to understand correclty your answer...



                    How does it work? To me:




                    • if I do .rangeBetween(-3,-1) I will use a window which look 3 line before the current line to one line before the current line. But here it seems that rangeBetween is refering to the orderby variable and not the number total of lines..???

                    • if I do "partitionBy(col("email"))" I should have one line by email but here i still get oneline by advert_id...


                    What I really want to do is count respectively, the number of sold item and the number of un-sold items in the 2 months preceding the advert post date, by the same email.



                    Is it an easy way to use what you did and apply it to my real issue?



                    My dataframe look like this:



                    var df = sc.parallelize(Array(
                    (1, "2015-06-29 10:53:53.0","boulanger.fr", 1),
                    (2, "2016-07-05 10:48:57.0","patissier.fr", 1),
                    (3, "2017-06-28 10:31:42.0","boulanger.fr", 1),
                    (4, "2017-08-21 17:31:12.0","patissier.fr", 0),
                    (5, "2015-07-28 11:22:42.0","boulanger.fr", 0),
                    (6, "2017-08-23 17:03:43.0","patissier.fr", 0),
                    (7, "2017-08-24 16:08:07.0","boulanger.fr", 1),
                    (8, "2014-08-31 17:20:43.0","patissier.fr", 1),
                    (9, "2017-09-04 14:35:38.0","boulanger.fr", 1),
                    (10, "2012-09-07 15:10:34.0","patissier.fr", 0))).toDF("id_advert", "creation_date","email", "sold")


                    For each id_advert I would like to have 2 lines. One for the number of sold items and one for the number of un-sold items...



                    Thank you in advance!!! If it is not possible for you to unswer I will do it in a more durty way ;-).






                    share|improve this answer





















                    • So you want sold and unsold count rather than, count of advert by the same mailid ? Am I right. Also, would it work if both sold and unsold are part of the row i.e. come as two columns ?
                      – Biswanath
                      Nov 19 at 16:20










                    • Yes that is exactly what I want. A dataframe with 3 columns (id_advert, sold, number_previous). I suppose that if you manage to gave me a result as 2 columns I should manage to get 2 row so yes it would be great!!! :-)
                      – Anneso
                      Nov 19 at 16:29















                    up vote
                    0
                    down vote













                    As it impossible to structure correctly a comment I will use the answer button but it is actually more a question than an answer.



                    I simplify the problem thinking that with your answer I might be able to do what I want to do but I am not sure to understand correclty your answer...



                    How does it work? To me:




                    • if I do .rangeBetween(-3,-1) I will use a window which look 3 line before the current line to one line before the current line. But here it seems that rangeBetween is refering to the orderby variable and not the number total of lines..???

                    • if I do "partitionBy(col("email"))" I should have one line by email but here i still get oneline by advert_id...


                    What I really want to do is count respectively, the number of sold item and the number of un-sold items in the 2 months preceding the advert post date, by the same email.



                    Is it an easy way to use what you did and apply it to my real issue?



                    My dataframe look like this:



                    var df = sc.parallelize(Array(
                    (1, "2015-06-29 10:53:53.0","boulanger.fr", 1),
                    (2, "2016-07-05 10:48:57.0","patissier.fr", 1),
                    (3, "2017-06-28 10:31:42.0","boulanger.fr", 1),
                    (4, "2017-08-21 17:31:12.0","patissier.fr", 0),
                    (5, "2015-07-28 11:22:42.0","boulanger.fr", 0),
                    (6, "2017-08-23 17:03:43.0","patissier.fr", 0),
                    (7, "2017-08-24 16:08:07.0","boulanger.fr", 1),
                    (8, "2014-08-31 17:20:43.0","patissier.fr", 1),
                    (9, "2017-09-04 14:35:38.0","boulanger.fr", 1),
                    (10, "2012-09-07 15:10:34.0","patissier.fr", 0))).toDF("id_advert", "creation_date","email", "sold")


                    For each id_advert I would like to have 2 lines. One for the number of sold items and one for the number of un-sold items...



                    Thank you in advance!!! If it is not possible for you to unswer I will do it in a more durty way ;-).






                    share|improve this answer





















                    • So you want sold and unsold count rather than, count of advert by the same mailid ? Am I right. Also, would it work if both sold and unsold are part of the row i.e. come as two columns ?
                      – Biswanath
                      Nov 19 at 16:20










                    • Yes that is exactly what I want. A dataframe with 3 columns (id_advert, sold, number_previous). I suppose that if you manage to gave me a result as 2 columns I should manage to get 2 row so yes it would be great!!! :-)
                      – Anneso
                      Nov 19 at 16:29













                    up vote
                    0
                    down vote










                    up vote
                    0
                    down vote









                    As it impossible to structure correctly a comment I will use the answer button but it is actually more a question than an answer.



                    I simplify the problem thinking that with your answer I might be able to do what I want to do but I am not sure to understand correclty your answer...



                    How does it work? To me:




                    • if I do .rangeBetween(-3,-1) I will use a window which look 3 line before the current line to one line before the current line. But here it seems that rangeBetween is refering to the orderby variable and not the number total of lines..???

                    • if I do "partitionBy(col("email"))" I should have one line by email but here i still get oneline by advert_id...


                    What I really want to do is count respectively, the number of sold item and the number of un-sold items in the 2 months preceding the advert post date, by the same email.



                    Is it an easy way to use what you did and apply it to my real issue?



                    My dataframe look like this:



                    var df = sc.parallelize(Array(
                    (1, "2015-06-29 10:53:53.0","boulanger.fr", 1),
                    (2, "2016-07-05 10:48:57.0","patissier.fr", 1),
                    (3, "2017-06-28 10:31:42.0","boulanger.fr", 1),
                    (4, "2017-08-21 17:31:12.0","patissier.fr", 0),
                    (5, "2015-07-28 11:22:42.0","boulanger.fr", 0),
                    (6, "2017-08-23 17:03:43.0","patissier.fr", 0),
                    (7, "2017-08-24 16:08:07.0","boulanger.fr", 1),
                    (8, "2014-08-31 17:20:43.0","patissier.fr", 1),
                    (9, "2017-09-04 14:35:38.0","boulanger.fr", 1),
                    (10, "2012-09-07 15:10:34.0","patissier.fr", 0))).toDF("id_advert", "creation_date","email", "sold")


                    For each id_advert I would like to have 2 lines. One for the number of sold items and one for the number of un-sold items...



                    Thank you in advance!!! If it is not possible for you to unswer I will do it in a more durty way ;-).






                    share|improve this answer












                    As it impossible to structure correctly a comment I will use the answer button but it is actually more a question than an answer.



                    I simplify the problem thinking that with your answer I might be able to do what I want to do but I am not sure to understand correclty your answer...



                    How does it work? To me:




                    • if I do .rangeBetween(-3,-1) I will use a window which look 3 line before the current line to one line before the current line. But here it seems that rangeBetween is refering to the orderby variable and not the number total of lines..???

                    • if I do "partitionBy(col("email"))" I should have one line by email but here i still get oneline by advert_id...


                    What I really want to do is count respectively, the number of sold item and the number of un-sold items in the 2 months preceding the advert post date, by the same email.



                    Is it an easy way to use what you did and apply it to my real issue?



                    My dataframe look like this:



                    var df = sc.parallelize(Array(
                    (1, "2015-06-29 10:53:53.0","boulanger.fr", 1),
                    (2, "2016-07-05 10:48:57.0","patissier.fr", 1),
                    (3, "2017-06-28 10:31:42.0","boulanger.fr", 1),
                    (4, "2017-08-21 17:31:12.0","patissier.fr", 0),
                    (5, "2015-07-28 11:22:42.0","boulanger.fr", 0),
                    (6, "2017-08-23 17:03:43.0","patissier.fr", 0),
                    (7, "2017-08-24 16:08:07.0","boulanger.fr", 1),
                    (8, "2014-08-31 17:20:43.0","patissier.fr", 1),
                    (9, "2017-09-04 14:35:38.0","boulanger.fr", 1),
                    (10, "2012-09-07 15:10:34.0","patissier.fr", 0))).toDF("id_advert", "creation_date","email", "sold")


                    For each id_advert I would like to have 2 lines. One for the number of sold items and one for the number of un-sold items...



                    Thank you in advance!!! If it is not possible for you to unswer I will do it in a more durty way ;-).







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Nov 19 at 15:53









                    Anneso

                    769




                    769












                    • So you want sold and unsold count rather than, count of advert by the same mailid ? Am I right. Also, would it work if both sold and unsold are part of the row i.e. come as two columns ?
                      – Biswanath
                      Nov 19 at 16:20










                    • Yes that is exactly what I want. A dataframe with 3 columns (id_advert, sold, number_previous). I suppose that if you manage to gave me a result as 2 columns I should manage to get 2 row so yes it would be great!!! :-)
                      – Anneso
                      Nov 19 at 16:29


















                    • So you want sold and unsold count rather than, count of advert by the same mailid ? Am I right. Also, would it work if both sold and unsold are part of the row i.e. come as two columns ?
                      – Biswanath
                      Nov 19 at 16:20










                    • Yes that is exactly what I want. A dataframe with 3 columns (id_advert, sold, number_previous). I suppose that if you manage to gave me a result as 2 columns I should manage to get 2 row so yes it would be great!!! :-)
                      – Anneso
                      Nov 19 at 16:29
















                    So you want sold and unsold count rather than, count of advert by the same mailid ? Am I right. Also, would it work if both sold and unsold are part of the row i.e. come as two columns ?
                    – Biswanath
                    Nov 19 at 16:20




                    So you want sold and unsold count rather than, count of advert by the same mailid ? Am I right. Also, would it work if both sold and unsold are part of the row i.e. come as two columns ?
                    – Biswanath
                    Nov 19 at 16:20












                    Yes that is exactly what I want. A dataframe with 3 columns (id_advert, sold, number_previous). I suppose that if you manage to gave me a result as 2 columns I should manage to get 2 row so yes it would be great!!! :-)
                    – Anneso
                    Nov 19 at 16:29




                    Yes that is exactly what I want. A dataframe with 3 columns (id_advert, sold, number_previous). I suppose that if you manage to gave me a result as 2 columns I should manage to get 2 row so yes it would be great!!! :-)
                    – Anneso
                    Nov 19 at 16:29


















                     

                    draft saved


                    draft discarded



















































                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53333635%2fuse-window-to-count-lines-with-if-condition-in-scala%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Feedback on college project

                    Futebolista

                    Albești (Vaslui)