Use Window to count lines with if condition in scala

up vote
1
down vote

favorite

I hope that you might help me :-)

I have a dataframe with posted advert .
I want, for each id of advert to count the number of advert posted in the 2 month preceding this one, by the same email.

I created the dataframe below to explain things better:

var df = sc.parallelize(Array(

(1,  "2017-06-29 10:53:53.0","boulanger.fr" ,"2017-06-28","2017-04-29"), 

(2,  "2017-07-05 10:48:57.0","patissier.fr","2017-07-04","2017-05-05"), 

(3,  "2017-06-28 10:31:42.0","boulanger.fr" ,"2017-08-16","2017-06-17"), 

(4,  "2017-08-21 17:31:12.0","patissier.fr","2017-08-20","2017-06-21"), 

(5,  "2017-07-28 11:22:42.0","boulanger.fr" ,"2017-08-22","2017-06-23"), 

(6,  "2017-08-23 17:03:43.0","patissier.fr","2017-08-22","2017-06-23"), 

(7,  "2017-08-24 16:08:07.0","boulanger.fr" ,"2017-08-23","2017-06-24"), 

(8,  "2017-08-31 17:20:43.0","patissier.fr","2017-08-30","2017-06-30"), 

(9,  "2017-09-04 14:35:38.0","boulanger.fr" ,"2017-09-03","2017-07-04"), 

(10, "2017-09-07 15:10:34.0","patissier.fr","2017-09-06","2017-07-07"))).toDF("id_advert", "creation_date",

    "email", "date_minus1","date_minus2m")



df = df.withColumn("date_minus1", to_date(unix_timestamp($"date_minus1", "yyyy-MM-dd").cast("timestamp")))  

df = df.withColumn("date_minus2", to_date(unix_timestamp($"date_minus2", "yyyy-MM-dd").cast("timestamp"))) 

df = df.withColumn("date_crecreation", (unix_timestamp($"creation_date", "yyyy-MM-dd HH:mm:ss").cast("timestamp")))

date_minus1 = the day before the advert was posted

date_minus2m = 2 month before the advert was posted

I want to count the number of advert, with the same email, between those 2 dates...

What I want as a result is:

+---------+----------------+

|id_advert|nb_prev_advert  |

+---------+----------------+

|6        |2               |

|3        |3               |

|5        |3               |

|9        |2               |

|4        |1               |

|8        |3               |

|7        |3               |

|10       |3               |

+--------+-----------------+

I manage to do that with an awfull join from the dataframe by itself but as I have millions of lines it took almost 2 hours to run...

I am sur the we can do something like:

val w = Window.partitionBy("id_advert").orderBy("creation_date").rowsBetween(-50000000, -1)

And use it to go across the dataframe and count only row with

email of the row = email of the current_row

date_minus2m of the row< date creation of the current row < date_minus1 of the row

edited Nov 16 at 8:35

asked Nov 16 at 7:58

Anneso

769

add a comment |

up vote
1
down vote

favorite

I hope that you might help me :-)

I have a dataframe with posted advert .
I want, for each id of advert to count the number of advert posted in the 2 month preceding this one, by the same email.

I created the dataframe below to explain things better:

var df = sc.parallelize(Array(

(1,  "2017-06-29 10:53:53.0","boulanger.fr" ,"2017-06-28","2017-04-29"), 

(2,  "2017-07-05 10:48:57.0","patissier.fr","2017-07-04","2017-05-05"), 

(3,  "2017-06-28 10:31:42.0","boulanger.fr" ,"2017-08-16","2017-06-17"), 

(4,  "2017-08-21 17:31:12.0","patissier.fr","2017-08-20","2017-06-21"), 

(5,  "2017-07-28 11:22:42.0","boulanger.fr" ,"2017-08-22","2017-06-23"), 

(6,  "2017-08-23 17:03:43.0","patissier.fr","2017-08-22","2017-06-23"), 

(7,  "2017-08-24 16:08:07.0","boulanger.fr" ,"2017-08-23","2017-06-24"), 

(8,  "2017-08-31 17:20:43.0","patissier.fr","2017-08-30","2017-06-30"), 

(9,  "2017-09-04 14:35:38.0","boulanger.fr" ,"2017-09-03","2017-07-04"), 

(10, "2017-09-07 15:10:34.0","patissier.fr","2017-09-06","2017-07-07"))).toDF("id_advert", "creation_date",

    "email", "date_minus1","date_minus2m")



df = df.withColumn("date_minus1", to_date(unix_timestamp($"date_minus1", "yyyy-MM-dd").cast("timestamp")))  

df = df.withColumn("date_minus2", to_date(unix_timestamp($"date_minus2", "yyyy-MM-dd").cast("timestamp"))) 

df = df.withColumn("date_crecreation", (unix_timestamp($"creation_date", "yyyy-MM-dd HH:mm:ss").cast("timestamp")))

date_minus1 = the day before the advert was posted

date_minus2m = 2 month before the advert was posted

I want to count the number of advert, with the same email, between those 2 dates...

What I want as a result is:

+---------+----------------+

|id_advert|nb_prev_advert  |

+---------+----------------+

|6        |2               |

|3        |3               |

|5        |3               |

|9        |2               |

|4        |1               |

|8        |3               |

|7        |3               |

|10       |3               |

+--------+-----------------+

I manage to do that with an awfull join from the dataframe by itself but as I have millions of lines it took almost 2 hours to run...

I am sur the we can do something like:

val w = Window.partitionBy("id_advert").orderBy("creation_date").rowsBetween(-50000000, -1)

And use it to go across the dataframe and count only row with

email of the row = email of the current_row

date_minus2m of the row< date creation of the current row < date_minus1 of the row

edited Nov 16 at 8:35

asked Nov 16 at 7:58

Anneso

769

add a comment |

up vote
1
down vote

favorite

I hope that you might help me :-)

I have a dataframe with posted advert .
I want, for each id of advert to count the number of advert posted in the 2 month preceding this one, by the same email.

I created the dataframe below to explain things better:

var df = sc.parallelize(Array(

(1,  "2017-06-29 10:53:53.0","boulanger.fr" ,"2017-06-28","2017-04-29"), 

(2,  "2017-07-05 10:48:57.0","patissier.fr","2017-07-04","2017-05-05"), 

(3,  "2017-06-28 10:31:42.0","boulanger.fr" ,"2017-08-16","2017-06-17"), 

(4,  "2017-08-21 17:31:12.0","patissier.fr","2017-08-20","2017-06-21"), 

(5,  "2017-07-28 11:22:42.0","boulanger.fr" ,"2017-08-22","2017-06-23"), 

(6,  "2017-08-23 17:03:43.0","patissier.fr","2017-08-22","2017-06-23"), 

(7,  "2017-08-24 16:08:07.0","boulanger.fr" ,"2017-08-23","2017-06-24"), 

(8,  "2017-08-31 17:20:43.0","patissier.fr","2017-08-30","2017-06-30"), 

(9,  "2017-09-04 14:35:38.0","boulanger.fr" ,"2017-09-03","2017-07-04"), 

(10, "2017-09-07 15:10:34.0","patissier.fr","2017-09-06","2017-07-07"))).toDF("id_advert", "creation_date",

    "email", "date_minus1","date_minus2m")



df = df.withColumn("date_minus1", to_date(unix_timestamp($"date_minus1", "yyyy-MM-dd").cast("timestamp")))  

df = df.withColumn("date_minus2", to_date(unix_timestamp($"date_minus2", "yyyy-MM-dd").cast("timestamp"))) 

df = df.withColumn("date_crecreation", (unix_timestamp($"creation_date", "yyyy-MM-dd HH:mm:ss").cast("timestamp")))

date_minus1 = the day before the advert was posted

date_minus2m = 2 month before the advert was posted

I want to count the number of advert, with the same email, between those 2 dates...

What I want as a result is:

+---------+----------------+

|id_advert|nb_prev_advert  |

+---------+----------------+

|6        |2               |

|3        |3               |

|5        |3               |

|9        |2               |

|4        |1               |

|8        |3               |

|7        |3               |

|10       |3               |

+--------+-----------------+

I manage to do that with an awfull join from the dataframe by itself but as I have millions of lines it took almost 2 hours to run...

I am sur the we can do something like:

val w = Window.partitionBy("id_advert").orderBy("creation_date").rowsBetween(-50000000, -1)

And use it to go across the dataframe and count only row with

email of the row = email of the current_row

date_minus2m of the row< date creation of the current row < date_minus1 of the row

edited Nov 16 at 8:35

asked Nov 16 at 7:58

Anneso

769

I hope that you might help me :-)

I have a dataframe with posted advert .
I want, for each id of advert to count the number of advert posted in the 2 month preceding this one, by the same email.

I created the dataframe below to explain things better:

var df = sc.parallelize(Array(

(1,  "2017-06-29 10:53:53.0","boulanger.fr" ,"2017-06-28","2017-04-29"), 

(2,  "2017-07-05 10:48:57.0","patissier.fr","2017-07-04","2017-05-05"), 

(3,  "2017-06-28 10:31:42.0","boulanger.fr" ,"2017-08-16","2017-06-17"), 

(4,  "2017-08-21 17:31:12.0","patissier.fr","2017-08-20","2017-06-21"), 

(5,  "2017-07-28 11:22:42.0","boulanger.fr" ,"2017-08-22","2017-06-23"), 

(6,  "2017-08-23 17:03:43.0","patissier.fr","2017-08-22","2017-06-23"), 

(7,  "2017-08-24 16:08:07.0","boulanger.fr" ,"2017-08-23","2017-06-24"), 

(8,  "2017-08-31 17:20:43.0","patissier.fr","2017-08-30","2017-06-30"), 

(9,  "2017-09-04 14:35:38.0","boulanger.fr" ,"2017-09-03","2017-07-04"), 

(10, "2017-09-07 15:10:34.0","patissier.fr","2017-09-06","2017-07-07"))).toDF("id_advert", "creation_date",

    "email", "date_minus1","date_minus2m")



df = df.withColumn("date_minus1", to_date(unix_timestamp($"date_minus1", "yyyy-MM-dd").cast("timestamp")))  

df = df.withColumn("date_minus2", to_date(unix_timestamp($"date_minus2", "yyyy-MM-dd").cast("timestamp"))) 

df = df.withColumn("date_crecreation", (unix_timestamp($"creation_date", "yyyy-MM-dd HH:mm:ss").cast("timestamp")))

date_minus1 = the day before the advert was posted

date_minus2m = 2 month before the advert was posted

I want to count the number of advert, with the same email, between those 2 dates...

What I want as a result is:

+---------+----------------+

|id_advert|nb_prev_advert  |

+---------+----------------+

|6        |2               |

|3        |3               |

|5        |3               |

|9        |2               |

|4        |1               |

|8        |3               |

|7        |3               |

|10       |3               |

+--------+-----------------+

I manage to do that with an awfull join from the dataframe by itself but as I have millions of lines it took almost 2 hours to run...

I am sur the we can do something like:

val w = Window.partitionBy("id_advert").orderBy("creation_date").rowsBetween(-50000000, -1)

And use it to go across the dataframe and count only row with

email of the row = email of the current_row

date_minus2m of the row< date creation of the current row < date_minus1 of the row

scala apache-spark

edited Nov 16 at 8:35

asked Nov 16 at 7:58

Anneso

769

edited Nov 16 at 8:35

asked Nov 16 at 7:58

Anneso

769

edited Nov 16 at 8:35

asked Nov 16 at 7:58

Anneso

769

asked Nov 16 at 7:58

Anneso

769

asked Nov 16 at 7:58

Anneso

769

add a comment |

3 Answers
3

active

oldest

votes

up vote
1
down vote

accepted

Adding this as different answer because it is different

Input:

df.select("*").orderBy("email","creation_date").show()



+---------+--------------------+------------+----+

|id_advert|       creation_date|       email|sold|

+---------+--------------------+------------+----+

|        1|2015-06-29 10:53:...|boulanger.fr|   1|

|        5|2015-07-28 11:22:...|boulanger.fr|   0|

|        3|2017-06-28 10:31:...|boulanger.fr|   1|

|        7|2017-08-24 16:08:...|boulanger.fr|   1|

|        9|2017-09-04 14:35:...|boulanger.fr|   1|

|       10|2012-09-07 15:10:...|patissier.fr|   0|

|        8|2014-08-31 17:20:...|patissier.fr|   1|

|        2|2016-07-05 10:48:...|patissier.fr|   1|

|        4|2017-08-21 17:31:...|patissier.fr|   0|

|        6|2017-08-23 17:03:...|patissier.fr|   0|

+---------+--------------------+------------+----+

Now you define your window spec as something like this

val w = Window.

          partitionBy("email").

          orderBy(col("creation_date"). 

          cast("timestamp").

          cast("long")).rangeBetween(-60*24*60*60,-1)

And the main query will be:

df.

  select(

      col("*"),count("email").over(w).alias("all_prev_mail_advert"), 

      sum("sold").over(w).alias("all_prev_sold_mail_advert")

  ).orderBy("email","creation_date").show()

Output:

+---------+--------------------+------------+----+--------------------+-------------------------+

|id_advert|       creation_date|       email|sold|all_prev_mail_advert|all_prev_sold_mail_advert|

+---------+--------------------+------------+----+--------------------+-------------------------+

|        1|2015-06-29 10:53:...|boulanger.fr|   1|                   0|                     null|

|        5|2015-07-28 11:22:...|boulanger.fr|   0|                   1|                        1|

|        3|2017-06-28 10:31:...|boulanger.fr|   1|                   0|                     null|

|        7|2017-08-24 16:08:...|boulanger.fr|   1|                   1|                        1|

|        9|2017-09-04 14:35:...|boulanger.fr|   1|                   1|                        1|

|       10|2012-09-07 15:10:...|patissier.fr|   0|                   0|                     null|

|        8|2014-08-31 17:20:...|patissier.fr|   1|                   0|                     null|

|        2|2016-07-05 10:48:...|patissier.fr|   1|                   0|                     null|

|        4|2017-08-21 17:31:...|patissier.fr|   0|                   0|                     null|

|        6|2017-08-23 17:03:...|patissier.fr|   0|                   1|                        0|

+---------+--------------------+------------+----+--------------------+-------------------------+

Explanation:

We are defining a window function for the last two months partitioned by email. And the count over this window gives all the previous advert for the same email.

And to get all the previous sold advert we are simply adding the sold column over the same window. As sold is 1 for sold item, the sum gives the count of all the sold item over this window.

answered Nov 19 at 18:01

Biswanath

5,106103556

add a comment |

up vote
1
down vote

Here is the answer with using Window with a range

Create a window spec with range between current and past sixty days

val w = Window

          .partitionBy(col("email"))

          .orderBy(col("creation_date").cast("timestamp").cast("long"))

          .rangeBetween(-60*86400,-1)

Then select it over your data frame

df

 .select(col("*"),count("email").over(w).alias("trailing_count"))

 .orderBy("email","creation_date") //using this for display purpose 

 .show()

Note: Your expected output might be wrong. One, there would be at least a zero for a advert because something must be starting row for a mail. Also, count for advertid 3 seems wrong.

Input Data :

df.select("id_advert","creation_date","email").orderBy("email", "creation_date").show()



+---------+--------------------+------------+

|id_advert|       creation_date|       email|

+---------+--------------------+------------+

|        3|2017-06-28 10:31:...|boulanger.fr|

|        1|2017-06-29 10:53:...|boulanger.fr|

|        5|2017-07-28 11:22:...|boulanger.fr|

|        7|2017-08-24 16:08:...|boulanger.fr|

|        9|2017-09-04 14:35:...|boulanger.fr|

|        2|2017-07-05 10:48:...|patissier.fr|

|        4|2017-08-21 17:31:...|patissier.fr|

|        6|2017-08-23 17:03:...|patissier.fr|

|        8|2017-08-31 17:20:...|patissier.fr|

|       10|2017-09-07 15:10:...|patissier.fr|

+---------+--------------------+------------+

Output:

+---------+--------------------+------------+-------------+--------------+

|id_advert|       creation_date|       email|date_creation|trailing_count|

+---------+--------------------+------------+-------------+--------------+

|        3|2017-06-28 10:31:...|boulanger.fr|   1498645902|             0|

|        1|2017-06-29 10:53:...|boulanger.fr|   1498733633|             1|

|        5|2017-07-28 11:22:...|boulanger.fr|   1501240962|             2|

|        7|2017-08-24 16:08:...|boulanger.fr|   1503590887|             3|

|        9|2017-09-04 14:35:...|boulanger.fr|   1504535738|             2|

|        2|2017-07-05 10:48:...|patissier.fr|   1499251737|             0|

|        4|2017-08-21 17:31:...|patissier.fr|   1503336672|             1|

|        6|2017-08-23 17:03:...|patissier.fr|   1503507823|             2|

|        8|2017-08-31 17:20:...|patissier.fr|   1504200043|             3|

|       10|2017-09-07 15:10:...|patissier.fr|   1504797034|             3|

+---------+--------------------+------------+-------------+--------------+

answered Nov 18 at 18:45

Biswanath

5,106103556

Not sure. What is the problem for you. If you are running in sbt-console/spark-shell, I will make sure that the whole val w= ...... is in a single line. As I assume, w is just getting evaluated to only "Window" part. If possible post your whole code.
– Biswanath
Nov 19 at 15:12

Thank you very much!!! It works! However I just do one modification to avoid an error: "df.withColumn("trailing_count", count("email") over w).show"
– Anneso
Nov 19 at 15:26

It was not easy to explain what I do not understand so I made another question (that I posted as an answer... sorry for that)
– Anneso
Nov 19 at 15:54

add a comment |

up vote
0
down vote

As it impossible to structure correctly a comment I will use the answer button but it is actually more a question than an answer.

I simplify the problem thinking that with your answer I might be able to do what I want to do but I am not sure to understand correclty your answer...

How does it work? To me:

if I do .rangeBetween(-3,-1) I will use a window which look 3 line before the current line to one line before the current line. But here it seems that rangeBetween is refering to the orderby variable and not the number total of lines..???

if I do "partitionBy(col("email"))" I should have one line by email but here i still get oneline by advert_id...

What I really want to do is count respectively, the number of sold item and the number of un-sold items in the 2 months preceding the advert post date, by the same email.

Is it an easy way to use what you did and apply it to my real issue?

My dataframe look like this:

var df = sc.parallelize(Array(

(1,  "2015-06-29 10:53:53.0","boulanger.fr", 1),

(2,  "2016-07-05 10:48:57.0","patissier.fr", 1),

(3,  "2017-06-28 10:31:42.0","boulanger.fr", 1),

(4,  "2017-08-21 17:31:12.0","patissier.fr", 0),

(5,  "2015-07-28 11:22:42.0","boulanger.fr", 0),

(6,  "2017-08-23 17:03:43.0","patissier.fr", 0),

(7,  "2017-08-24 16:08:07.0","boulanger.fr", 1),

(8,  "2014-08-31 17:20:43.0","patissier.fr", 1),

(9,  "2017-09-04 14:35:38.0","boulanger.fr", 1),

(10, "2012-09-07 15:10:34.0","patissier.fr", 0))).toDF("id_advert", "creation_date","email", "sold")

For each id_advert I would like to have 2 lines. One for the number of sold items and one for the number of un-sold items...

Thank you in advance!!! If it is not possible for you to unswer I will do it in a more durty way ;-).

answered Nov 19 at 15:53

Anneso

769

So you want sold and unsold count rather than, count of advert by the same mailid ? Am I right. Also, would it work if both sold and unsold are part of the row i.e. come as two columns ?
– Biswanath
Nov 19 at 16:20

Yes that is exactly what I want. A dataframe with 3 columns (id_advert, sold, number_previous). I suppose that if you manage to gave me a result as 2 columns I should manage to get 2 row so yes it would be great!!! :-)
– Anneso
Nov 19 at 16:29

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53333635%2fuse-window-to-count-lines-with-if-condition-in-scala%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
1
down vote

accepted

Adding this as different answer because it is different

Input:

df.select("*").orderBy("email","creation_date").show()



+---------+--------------------+------------+----+

|id_advert|       creation_date|       email|sold|

+---------+--------------------+------------+----+

|        1|2015-06-29 10:53:...|boulanger.fr|   1|

|        5|2015-07-28 11:22:...|boulanger.fr|   0|

|        3|2017-06-28 10:31:...|boulanger.fr|   1|

|        7|2017-08-24 16:08:...|boulanger.fr|   1|

|        9|2017-09-04 14:35:...|boulanger.fr|   1|

|       10|2012-09-07 15:10:...|patissier.fr|   0|

|        8|2014-08-31 17:20:...|patissier.fr|   1|

|        2|2016-07-05 10:48:...|patissier.fr|   1|

|        4|2017-08-21 17:31:...|patissier.fr|   0|

|        6|2017-08-23 17:03:...|patissier.fr|   0|

+---------+--------------------+------------+----+

Now you define your window spec as something like this

val w = Window.

          partitionBy("email").

          orderBy(col("creation_date"). 

          cast("timestamp").

          cast("long")).rangeBetween(-60*24*60*60,-1)

And the main query will be:

df.

  select(

      col("*"),count("email").over(w).alias("all_prev_mail_advert"), 

      sum("sold").over(w).alias("all_prev_sold_mail_advert")

  ).orderBy("email","creation_date").show()

Output:

+---------+--------------------+------------+----+--------------------+-------------------------+

|id_advert|       creation_date|       email|sold|all_prev_mail_advert|all_prev_sold_mail_advert|

+---------+--------------------+------------+----+--------------------+-------------------------+

|        1|2015-06-29 10:53:...|boulanger.fr|   1|                   0|                     null|

|        5|2015-07-28 11:22:...|boulanger.fr|   0|                   1|                        1|

|        3|2017-06-28 10:31:...|boulanger.fr|   1|                   0|                     null|

|        7|2017-08-24 16:08:...|boulanger.fr|   1|                   1|                        1|

|        9|2017-09-04 14:35:...|boulanger.fr|   1|                   1|                        1|

|       10|2012-09-07 15:10:...|patissier.fr|   0|                   0|                     null|

|        8|2014-08-31 17:20:...|patissier.fr|   1|                   0|                     null|

|        2|2016-07-05 10:48:...|patissier.fr|   1|                   0|                     null|

|        4|2017-08-21 17:31:...|patissier.fr|   0|                   0|                     null|

|        6|2017-08-23 17:03:...|patissier.fr|   0|                   1|                        0|

+---------+--------------------+------------+----+--------------------+-------------------------+

Explanation:

We are defining a window function for the last two months partitioned by email. And the count over this window gives all the previous advert for the same email.

And to get all the previous sold advert we are simply adding the sold column over the same window. As sold is 1 for sold item, the sum gives the count of all the sold item over this window.

answered Nov 19 at 18:01

Biswanath

5,106103556

add a comment |

up vote
1
down vote

accepted

Adding this as different answer because it is different

Input:

df.select("*").orderBy("email","creation_date").show()



+---------+--------------------+------------+----+

|id_advert|       creation_date|       email|sold|

+---------+--------------------+------------+----+

|        1|2015-06-29 10:53:...|boulanger.fr|   1|

|        5|2015-07-28 11:22:...|boulanger.fr|   0|

|        3|2017-06-28 10:31:...|boulanger.fr|   1|

|        7|2017-08-24 16:08:...|boulanger.fr|   1|

|        9|2017-09-04 14:35:...|boulanger.fr|   1|

|       10|2012-09-07 15:10:...|patissier.fr|   0|

|        8|2014-08-31 17:20:...|patissier.fr|   1|

|        2|2016-07-05 10:48:...|patissier.fr|   1|

|        4|2017-08-21 17:31:...|patissier.fr|   0|

|        6|2017-08-23 17:03:...|patissier.fr|   0|

+---------+--------------------+------------+----+

Now you define your window spec as something like this

val w = Window.

          partitionBy("email").

          orderBy(col("creation_date"). 

          cast("timestamp").

          cast("long")).rangeBetween(-60*24*60*60,-1)

And the main query will be:

df.

  select(

      col("*"),count("email").over(w).alias("all_prev_mail_advert"), 

      sum("sold").over(w).alias("all_prev_sold_mail_advert")

  ).orderBy("email","creation_date").show()

Output:

+---------+--------------------+------------+----+--------------------+-------------------------+

|id_advert|       creation_date|       email|sold|all_prev_mail_advert|all_prev_sold_mail_advert|

+---------+--------------------+------------+----+--------------------+-------------------------+

|        1|2015-06-29 10:53:...|boulanger.fr|   1|                   0|                     null|

|        5|2015-07-28 11:22:...|boulanger.fr|   0|                   1|                        1|

|        3|2017-06-28 10:31:...|boulanger.fr|   1|                   0|                     null|

|        7|2017-08-24 16:08:...|boulanger.fr|   1|                   1|                        1|

|        9|2017-09-04 14:35:...|boulanger.fr|   1|                   1|                        1|

|       10|2012-09-07 15:10:...|patissier.fr|   0|                   0|                     null|

|        8|2014-08-31 17:20:...|patissier.fr|   1|                   0|                     null|

|        2|2016-07-05 10:48:...|patissier.fr|   1|                   0|                     null|

|        4|2017-08-21 17:31:...|patissier.fr|   0|                   0|                     null|

|        6|2017-08-23 17:03:...|patissier.fr|   0|                   1|                        0|

+---------+--------------------+------------+----+--------------------+-------------------------+

Explanation:

We are defining a window function for the last two months partitioned by email. And the count over this window gives all the previous advert for the same email.

And to get all the previous sold advert we are simply adding the sold column over the same window. As sold is 1 for sold item, the sum gives the count of all the sold item over this window.

answered Nov 19 at 18:01

Biswanath

5,106103556

add a comment |

up vote
1
down vote

accepted

Adding this as different answer because it is different

Input:

df.select("*").orderBy("email","creation_date").show()



+---------+--------------------+------------+----+

|id_advert|       creation_date|       email|sold|

+---------+--------------------+------------+----+

|        1|2015-06-29 10:53:...|boulanger.fr|   1|

|        5|2015-07-28 11:22:...|boulanger.fr|   0|

|        3|2017-06-28 10:31:...|boulanger.fr|   1|

|        7|2017-08-24 16:08:...|boulanger.fr|   1|

|        9|2017-09-04 14:35:...|boulanger.fr|   1|

|       10|2012-09-07 15:10:...|patissier.fr|   0|

|        8|2014-08-31 17:20:...|patissier.fr|   1|

|        2|2016-07-05 10:48:...|patissier.fr|   1|

|        4|2017-08-21 17:31:...|patissier.fr|   0|

|        6|2017-08-23 17:03:...|patissier.fr|   0|

+---------+--------------------+------------+----+

Now you define your window spec as something like this

val w = Window.

          partitionBy("email").

          orderBy(col("creation_date"). 

          cast("timestamp").

          cast("long")).rangeBetween(-60*24*60*60,-1)

And the main query will be:

df.

  select(

      col("*"),count("email").over(w).alias("all_prev_mail_advert"), 

      sum("sold").over(w).alias("all_prev_sold_mail_advert")

  ).orderBy("email","creation_date").show()

Output:

+---------+--------------------+------------+----+--------------------+-------------------------+

|id_advert|       creation_date|       email|sold|all_prev_mail_advert|all_prev_sold_mail_advert|

+---------+--------------------+------------+----+--------------------+-------------------------+

|        1|2015-06-29 10:53:...|boulanger.fr|   1|                   0|                     null|

|        5|2015-07-28 11:22:...|boulanger.fr|   0|                   1|                        1|

|        3|2017-06-28 10:31:...|boulanger.fr|   1|                   0|                     null|

|        7|2017-08-24 16:08:...|boulanger.fr|   1|                   1|                        1|

|        9|2017-09-04 14:35:...|boulanger.fr|   1|                   1|                        1|

|       10|2012-09-07 15:10:...|patissier.fr|   0|                   0|                     null|

|        8|2014-08-31 17:20:...|patissier.fr|   1|                   0|                     null|

|        2|2016-07-05 10:48:...|patissier.fr|   1|                   0|                     null|

|        4|2017-08-21 17:31:...|patissier.fr|   0|                   0|                     null|

|        6|2017-08-23 17:03:...|patissier.fr|   0|                   1|                        0|

+---------+--------------------+------------+----+--------------------+-------------------------+

Explanation:

We are defining a window function for the last two months partitioned by email. And the count over this window gives all the previous advert for the same email.

And to get all the previous sold advert we are simply adding the sold column over the same window. As sold is 1 for sold item, the sum gives the count of all the sold item over this window.

answered Nov 19 at 18:01

Biswanath

5,106103556

Adding this as different answer because it is different

Input:

df.select("*").orderBy("email","creation_date").show()



+---------+--------------------+------------+----+

|id_advert|       creation_date|       email|sold|

+---------+--------------------+------------+----+

|        1|2015-06-29 10:53:...|boulanger.fr|   1|

|        5|2015-07-28 11:22:...|boulanger.fr|   0|

|        3|2017-06-28 10:31:...|boulanger.fr|   1|

|        7|2017-08-24 16:08:...|boulanger.fr|   1|

|        9|2017-09-04 14:35:...|boulanger.fr|   1|

|       10|2012-09-07 15:10:...|patissier.fr|   0|

|        8|2014-08-31 17:20:...|patissier.fr|   1|

|        2|2016-07-05 10:48:...|patissier.fr|   1|

|        4|2017-08-21 17:31:...|patissier.fr|   0|

|        6|2017-08-23 17:03:...|patissier.fr|   0|

+---------+--------------------+------------+----+

Now you define your window spec as something like this

val w = Window.

          partitionBy("email").

          orderBy(col("creation_date"). 

          cast("timestamp").

          cast("long")).rangeBetween(-60*24*60*60,-1)

And the main query will be:

df.

  select(

      col("*"),count("email").over(w).alias("all_prev_mail_advert"), 

      sum("sold").over(w).alias("all_prev_sold_mail_advert")

  ).orderBy("email","creation_date").show()

Output:

+---------+--------------------+------------+----+--------------------+-------------------------+

|id_advert|       creation_date|       email|sold|all_prev_mail_advert|all_prev_sold_mail_advert|

+---------+--------------------+------------+----+--------------------+-------------------------+

|        1|2015-06-29 10:53:...|boulanger.fr|   1|                   0|                     null|

|        5|2015-07-28 11:22:...|boulanger.fr|   0|                   1|                        1|

|        3|2017-06-28 10:31:...|boulanger.fr|   1|                   0|                     null|

|        7|2017-08-24 16:08:...|boulanger.fr|   1|                   1|                        1|

|        9|2017-09-04 14:35:...|boulanger.fr|   1|                   1|                        1|

|       10|2012-09-07 15:10:...|patissier.fr|   0|                   0|                     null|

|        8|2014-08-31 17:20:...|patissier.fr|   1|                   0|                     null|

|        2|2016-07-05 10:48:...|patissier.fr|   1|                   0|                     null|

|        4|2017-08-21 17:31:...|patissier.fr|   0|                   0|                     null|

|        6|2017-08-23 17:03:...|patissier.fr|   0|                   1|                        0|

+---------+--------------------+------------+----+--------------------+-------------------------+

Explanation:

We are defining a window function for the last two months partitioned by email. And the count over this window gives all the previous advert for the same email.

And to get all the previous sold advert we are simply adding the sold column over the same window. As sold is 1 for sold item, the sum gives the count of all the sold item over this window.

answered Nov 19 at 18:01

Biswanath

5,106103556

answered Nov 19 at 18:01

Biswanath

5,106103556

answered Nov 19 at 18:01

Biswanath

5,106103556

answered Nov 19 at 18:01

Biswanath

5,106103556

add a comment |

up vote
1
down vote

Here is the answer with using Window with a range

Create a window spec with range between current and past sixty days

val w = Window

          .partitionBy(col("email"))

          .orderBy(col("creation_date").cast("timestamp").cast("long"))

          .rangeBetween(-60*86400,-1)

Then select it over your data frame

df

 .select(col("*"),count("email").over(w).alias("trailing_count"))

 .orderBy("email","creation_date") //using this for display purpose 

 .show()

Note: Your expected output might be wrong. One, there would be at least a zero for a advert because something must be starting row for a mail. Also, count for advertid 3 seems wrong.

Input Data :

df.select("id_advert","creation_date","email").orderBy("email", "creation_date").show()



+---------+--------------------+------------+

|id_advert|       creation_date|       email|

+---------+--------------------+------------+

|        3|2017-06-28 10:31:...|boulanger.fr|

|        1|2017-06-29 10:53:...|boulanger.fr|

|        5|2017-07-28 11:22:...|boulanger.fr|

|        7|2017-08-24 16:08:...|boulanger.fr|

|        9|2017-09-04 14:35:...|boulanger.fr|

|        2|2017-07-05 10:48:...|patissier.fr|

|        4|2017-08-21 17:31:...|patissier.fr|

|        6|2017-08-23 17:03:...|patissier.fr|

|        8|2017-08-31 17:20:...|patissier.fr|

|       10|2017-09-07 15:10:...|patissier.fr|

+---------+--------------------+------------+

Output:

+---------+--------------------+------------+-------------+--------------+

|id_advert|       creation_date|       email|date_creation|trailing_count|

+---------+--------------------+------------+-------------+--------------+

|        3|2017-06-28 10:31:...|boulanger.fr|   1498645902|             0|

|        1|2017-06-29 10:53:...|boulanger.fr|   1498733633|             1|

|        5|2017-07-28 11:22:...|boulanger.fr|   1501240962|             2|

|        7|2017-08-24 16:08:...|boulanger.fr|   1503590887|             3|

|        9|2017-09-04 14:35:...|boulanger.fr|   1504535738|             2|

|        2|2017-07-05 10:48:...|patissier.fr|   1499251737|             0|

|        4|2017-08-21 17:31:...|patissier.fr|   1503336672|             1|

|        6|2017-08-23 17:03:...|patissier.fr|   1503507823|             2|

|        8|2017-08-31 17:20:...|patissier.fr|   1504200043|             3|

|       10|2017-09-07 15:10:...|patissier.fr|   1504797034|             3|

+---------+--------------------+------------+-------------+--------------+

answered Nov 18 at 18:45

Biswanath

5,106103556

Not sure. What is the problem for you. If you are running in sbt-console/spark-shell, I will make sure that the whole val w= ...... is in a single line. As I assume, w is just getting evaluated to only "Window" part. If possible post your whole code.
– Biswanath
Nov 19 at 15:12

Thank you very much!!! It works! However I just do one modification to avoid an error: "df.withColumn("trailing_count", count("email") over w).show"
– Anneso
Nov 19 at 15:26

It was not easy to explain what I do not understand so I made another question (that I posted as an answer... sorry for that)
– Anneso
Nov 19 at 15:54

add a comment |

up vote
1
down vote

Here is the answer with using Window with a range

Create a window spec with range between current and past sixty days

val w = Window

          .partitionBy(col("email"))

          .orderBy(col("creation_date").cast("timestamp").cast("long"))

          .rangeBetween(-60*86400,-1)

Then select it over your data frame

df

 .select(col("*"),count("email").over(w).alias("trailing_count"))

 .orderBy("email","creation_date") //using this for display purpose 

 .show()

Note: Your expected output might be wrong. One, there would be at least a zero for a advert because something must be starting row for a mail. Also, count for advertid 3 seems wrong.

Input Data :

df.select("id_advert","creation_date","email").orderBy("email", "creation_date").show()



+---------+--------------------+------------+

|id_advert|       creation_date|       email|

+---------+--------------------+------------+

|        3|2017-06-28 10:31:...|boulanger.fr|

|        1|2017-06-29 10:53:...|boulanger.fr|

|        5|2017-07-28 11:22:...|boulanger.fr|

|        7|2017-08-24 16:08:...|boulanger.fr|

|        9|2017-09-04 14:35:...|boulanger.fr|

|        2|2017-07-05 10:48:...|patissier.fr|

|        4|2017-08-21 17:31:...|patissier.fr|

|        6|2017-08-23 17:03:...|patissier.fr|

|        8|2017-08-31 17:20:...|patissier.fr|

|       10|2017-09-07 15:10:...|patissier.fr|

+---------+--------------------+------------+

Output:

+---------+--------------------+------------+-------------+--------------+

|id_advert|       creation_date|       email|date_creation|trailing_count|

+---------+--------------------+------------+-------------+--------------+

|        3|2017-06-28 10:31:...|boulanger.fr|   1498645902|             0|

|        1|2017-06-29 10:53:...|boulanger.fr|   1498733633|             1|

|        5|2017-07-28 11:22:...|boulanger.fr|   1501240962|             2|

|        7|2017-08-24 16:08:...|boulanger.fr|   1503590887|             3|

|        9|2017-09-04 14:35:...|boulanger.fr|   1504535738|             2|

|        2|2017-07-05 10:48:...|patissier.fr|   1499251737|             0|

|        4|2017-08-21 17:31:...|patissier.fr|   1503336672|             1|

|        6|2017-08-23 17:03:...|patissier.fr|   1503507823|             2|

|        8|2017-08-31 17:20:...|patissier.fr|   1504200043|             3|

|       10|2017-09-07 15:10:...|patissier.fr|   1504797034|             3|

+---------+--------------------+------------+-------------+--------------+

answered Nov 18 at 18:45

Biswanath

5,106103556

Not sure. What is the problem for you. If you are running in sbt-console/spark-shell, I will make sure that the whole val w= ...... is in a single line. As I assume, w is just getting evaluated to only "Window" part. If possible post your whole code.
– Biswanath
Nov 19 at 15:12

Thank you very much!!! It works! However I just do one modification to avoid an error: "df.withColumn("trailing_count", count("email") over w).show"
– Anneso
Nov 19 at 15:26

It was not easy to explain what I do not understand so I made another question (that I posted as an answer... sorry for that)
– Anneso
Nov 19 at 15:54

add a comment |

up vote
1
down vote

Here is the answer with using Window with a range

Create a window spec with range between current and past sixty days

val w = Window

          .partitionBy(col("email"))

          .orderBy(col("creation_date").cast("timestamp").cast("long"))

          .rangeBetween(-60*86400,-1)

Then select it over your data frame

df

 .select(col("*"),count("email").over(w).alias("trailing_count"))

 .orderBy("email","creation_date") //using this for display purpose 

 .show()

Note: Your expected output might be wrong. One, there would be at least a zero for a advert because something must be starting row for a mail. Also, count for advertid 3 seems wrong.

Input Data :

df.select("id_advert","creation_date","email").orderBy("email", "creation_date").show()



+---------+--------------------+------------+

|id_advert|       creation_date|       email|

+---------+--------------------+------------+

|        3|2017-06-28 10:31:...|boulanger.fr|

|        1|2017-06-29 10:53:...|boulanger.fr|

|        5|2017-07-28 11:22:...|boulanger.fr|

|        7|2017-08-24 16:08:...|boulanger.fr|

|        9|2017-09-04 14:35:...|boulanger.fr|

|        2|2017-07-05 10:48:...|patissier.fr|

|        4|2017-08-21 17:31:...|patissier.fr|

|        6|2017-08-23 17:03:...|patissier.fr|

|        8|2017-08-31 17:20:...|patissier.fr|

|       10|2017-09-07 15:10:...|patissier.fr|

+---------+--------------------+------------+

Output:

+---------+--------------------+------------+-------------+--------------+

|id_advert|       creation_date|       email|date_creation|trailing_count|

+---------+--------------------+------------+-------------+--------------+

|        3|2017-06-28 10:31:...|boulanger.fr|   1498645902|             0|

|        1|2017-06-29 10:53:...|boulanger.fr|   1498733633|             1|

|        5|2017-07-28 11:22:...|boulanger.fr|   1501240962|             2|

|        7|2017-08-24 16:08:...|boulanger.fr|   1503590887|             3|

|        9|2017-09-04 14:35:...|boulanger.fr|   1504535738|             2|

|        2|2017-07-05 10:48:...|patissier.fr|   1499251737|             0|

|        4|2017-08-21 17:31:...|patissier.fr|   1503336672|             1|

|        6|2017-08-23 17:03:...|patissier.fr|   1503507823|             2|

|        8|2017-08-31 17:20:...|patissier.fr|   1504200043|             3|

|       10|2017-09-07 15:10:...|patissier.fr|   1504797034|             3|

+---------+--------------------+------------+-------------+--------------+

answered Nov 18 at 18:45

Biswanath

5,106103556

Here is the answer with using Window with a range

Create a window spec with range between current and past sixty days

val w = Window

          .partitionBy(col("email"))

          .orderBy(col("creation_date").cast("timestamp").cast("long"))

          .rangeBetween(-60*86400,-1)

Then select it over your data frame

df

 .select(col("*"),count("email").over(w).alias("trailing_count"))

 .orderBy("email","creation_date") //using this for display purpose 

 .show()

Note: Your expected output might be wrong. One, there would be at least a zero for a advert because something must be starting row for a mail. Also, count for advertid 3 seems wrong.

Input Data :

df.select("id_advert","creation_date","email").orderBy("email", "creation_date").show()



+---------+--------------------+------------+

|id_advert|       creation_date|       email|

+---------+--------------------+------------+

|        3|2017-06-28 10:31:...|boulanger.fr|

|        1|2017-06-29 10:53:...|boulanger.fr|

|        5|2017-07-28 11:22:...|boulanger.fr|

|        7|2017-08-24 16:08:...|boulanger.fr|

|        9|2017-09-04 14:35:...|boulanger.fr|

|        2|2017-07-05 10:48:...|patissier.fr|

|        4|2017-08-21 17:31:...|patissier.fr|

|        6|2017-08-23 17:03:...|patissier.fr|

|        8|2017-08-31 17:20:...|patissier.fr|

|       10|2017-09-07 15:10:...|patissier.fr|

+---------+--------------------+------------+

Output:

+---------+--------------------+------------+-------------+--------------+

|id_advert|       creation_date|       email|date_creation|trailing_count|

+---------+--------------------+------------+-------------+--------------+

|        3|2017-06-28 10:31:...|boulanger.fr|   1498645902|             0|

|        1|2017-06-29 10:53:...|boulanger.fr|   1498733633|             1|

|        5|2017-07-28 11:22:...|boulanger.fr|   1501240962|             2|

|        7|2017-08-24 16:08:...|boulanger.fr|   1503590887|             3|

|        9|2017-09-04 14:35:...|boulanger.fr|   1504535738|             2|

|        2|2017-07-05 10:48:...|patissier.fr|   1499251737|             0|

|        4|2017-08-21 17:31:...|patissier.fr|   1503336672|             1|

|        6|2017-08-23 17:03:...|patissier.fr|   1503507823|             2|

|        8|2017-08-31 17:20:...|patissier.fr|   1504200043|             3|

|       10|2017-09-07 15:10:...|patissier.fr|   1504797034|             3|

+---------+--------------------+------------+-------------+--------------+

answered Nov 18 at 18:45

Biswanath

5,106103556

answered Nov 18 at 18:45

Biswanath

5,106103556

answered Nov 18 at 18:45

Biswanath

5,106103556

answered Nov 18 at 18:45

Biswanath

5,106103556

Not sure. What is the problem for you. If you are running in sbt-console/spark-shell, I will make sure that the whole val w= ...... is in a single line. As I assume, w is just getting evaluated to only "Window" part. If possible post your whole code.
– Biswanath
Nov 19 at 15:12

Thank you very much!!! It works! However I just do one modification to avoid an error: "df.withColumn("trailing_count", count("email") over w).show"
– Anneso
Nov 19 at 15:26

It was not easy to explain what I do not understand so I made another question (that I posted as an answer... sorry for that)
– Anneso
Nov 19 at 15:54

add a comment |

Not sure. What is the problem for you. If you are running in sbt-console/spark-shell, I will make sure that the whole val w= ...... is in a single line. As I assume, w is just getting evaluated to only "Window" part. If possible post your whole code.
– Biswanath
Nov 19 at 15:12

Thank you very much!!! It works! However I just do one modification to avoid an error: "df.withColumn("trailing_count", count("email") over w).show"
– Anneso
Nov 19 at 15:26

It was not easy to explain what I do not understand so I made another question (that I posted as an answer... sorry for that)
– Anneso
Nov 19 at 15:54

Not sure. What is the problem for you. If you are running in sbt-console/spark-shell, I will make sure that the whole val w= ...... is in a single line. As I assume, w is just getting evaluated to only "Window" part. If possible post your whole code.
– Biswanath
Nov 19 at 15:12

Thank you very much!!! It works! However I just do one modification to avoid an error: "df.withColumn("trailing_count", count("email") over w).show"
– Anneso
Nov 19 at 15:26

It was not easy to explain what I do not understand so I made another question (that I posted as an answer... sorry for that)
– Anneso
Nov 19 at 15:54

add a comment |

up vote
0
down vote

As it impossible to structure correctly a comment I will use the answer button but it is actually more a question than an answer.

I simplify the problem thinking that with your answer I might be able to do what I want to do but I am not sure to understand correclty your answer...

How does it work? To me:

if I do .rangeBetween(-3,-1) I will use a window which look 3 line before the current line to one line before the current line. But here it seems that rangeBetween is refering to the orderby variable and not the number total of lines..???

if I do "partitionBy(col("email"))" I should have one line by email but here i still get oneline by advert_id...

What I really want to do is count respectively, the number of sold item and the number of un-sold items in the 2 months preceding the advert post date, by the same email.

Is it an easy way to use what you did and apply it to my real issue?

My dataframe look like this:

var df = sc.parallelize(Array(

(1,  "2015-06-29 10:53:53.0","boulanger.fr", 1),

(2,  "2016-07-05 10:48:57.0","patissier.fr", 1),

(3,  "2017-06-28 10:31:42.0","boulanger.fr", 1),

(4,  "2017-08-21 17:31:12.0","patissier.fr", 0),

(5,  "2015-07-28 11:22:42.0","boulanger.fr", 0),

(6,  "2017-08-23 17:03:43.0","patissier.fr", 0),

(7,  "2017-08-24 16:08:07.0","boulanger.fr", 1),

(8,  "2014-08-31 17:20:43.0","patissier.fr", 1),

(9,  "2017-09-04 14:35:38.0","boulanger.fr", 1),

(10, "2012-09-07 15:10:34.0","patissier.fr", 0))).toDF("id_advert", "creation_date","email", "sold")

For each id_advert I would like to have 2 lines. One for the number of sold items and one for the number of un-sold items...

Thank you in advance!!! If it is not possible for you to unswer I will do it in a more durty way ;-).

answered Nov 19 at 15:53

Anneso

769

So you want sold and unsold count rather than, count of advert by the same mailid ? Am I right. Also, would it work if both sold and unsold are part of the row i.e. come as two columns ?
– Biswanath
Nov 19 at 16:20

Yes that is exactly what I want. A dataframe with 3 columns (id_advert, sold, number_previous). I suppose that if you manage to gave me a result as 2 columns I should manage to get 2 row so yes it would be great!!! :-)
– Anneso
Nov 19 at 16:29

add a comment |

up vote
0
down vote

As it impossible to structure correctly a comment I will use the answer button but it is actually more a question than an answer.

I simplify the problem thinking that with your answer I might be able to do what I want to do but I am not sure to understand correclty your answer...

How does it work? To me:

if I do .rangeBetween(-3,-1) I will use a window which look 3 line before the current line to one line before the current line. But here it seems that rangeBetween is refering to the orderby variable and not the number total of lines..???

if I do "partitionBy(col("email"))" I should have one line by email but here i still get oneline by advert_id...

What I really want to do is count respectively, the number of sold item and the number of un-sold items in the 2 months preceding the advert post date, by the same email.

Is it an easy way to use what you did and apply it to my real issue?

My dataframe look like this:

var df = sc.parallelize(Array(

(1,  "2015-06-29 10:53:53.0","boulanger.fr", 1),

(2,  "2016-07-05 10:48:57.0","patissier.fr", 1),

(3,  "2017-06-28 10:31:42.0","boulanger.fr", 1),

(4,  "2017-08-21 17:31:12.0","patissier.fr", 0),

(5,  "2015-07-28 11:22:42.0","boulanger.fr", 0),

(6,  "2017-08-23 17:03:43.0","patissier.fr", 0),

(7,  "2017-08-24 16:08:07.0","boulanger.fr", 1),

(8,  "2014-08-31 17:20:43.0","patissier.fr", 1),

(9,  "2017-09-04 14:35:38.0","boulanger.fr", 1),

(10, "2012-09-07 15:10:34.0","patissier.fr", 0))).toDF("id_advert", "creation_date","email", "sold")

For each id_advert I would like to have 2 lines. One for the number of sold items and one for the number of un-sold items...

Thank you in advance!!! If it is not possible for you to unswer I will do it in a more durty way ;-).

answered Nov 19 at 15:53

Anneso

769

So you want sold and unsold count rather than, count of advert by the same mailid ? Am I right. Also, would it work if both sold and unsold are part of the row i.e. come as two columns ?
– Biswanath
Nov 19 at 16:20

Yes that is exactly what I want. A dataframe with 3 columns (id_advert, sold, number_previous). I suppose that if you manage to gave me a result as 2 columns I should manage to get 2 row so yes it would be great!!! :-)
– Anneso
Nov 19 at 16:29

add a comment |

up vote
0
down vote

As it impossible to structure correctly a comment I will use the answer button but it is actually more a question than an answer.

I simplify the problem thinking that with your answer I might be able to do what I want to do but I am not sure to understand correclty your answer...

How does it work? To me:

if I do .rangeBetween(-3,-1) I will use a window which look 3 line before the current line to one line before the current line. But here it seems that rangeBetween is refering to the orderby variable and not the number total of lines..???

if I do "partitionBy(col("email"))" I should have one line by email but here i still get oneline by advert_id...

What I really want to do is count respectively, the number of sold item and the number of un-sold items in the 2 months preceding the advert post date, by the same email.

Is it an easy way to use what you did and apply it to my real issue?

My dataframe look like this:

var df = sc.parallelize(Array(

(1,  "2015-06-29 10:53:53.0","boulanger.fr", 1),

(2,  "2016-07-05 10:48:57.0","patissier.fr", 1),

(3,  "2017-06-28 10:31:42.0","boulanger.fr", 1),

(4,  "2017-08-21 17:31:12.0","patissier.fr", 0),

(5,  "2015-07-28 11:22:42.0","boulanger.fr", 0),

(6,  "2017-08-23 17:03:43.0","patissier.fr", 0),

(7,  "2017-08-24 16:08:07.0","boulanger.fr", 1),

(8,  "2014-08-31 17:20:43.0","patissier.fr", 1),

(9,  "2017-09-04 14:35:38.0","boulanger.fr", 1),

(10, "2012-09-07 15:10:34.0","patissier.fr", 0))).toDF("id_advert", "creation_date","email", "sold")

For each id_advert I would like to have 2 lines. One for the number of sold items and one for the number of un-sold items...

Thank you in advance!!! If it is not possible for you to unswer I will do it in a more durty way ;-).

answered Nov 19 at 15:53

Anneso

769

As it impossible to structure correctly a comment I will use the answer button but it is actually more a question than an answer.

I simplify the problem thinking that with your answer I might be able to do what I want to do but I am not sure to understand correclty your answer...

How does it work? To me:

if I do .rangeBetween(-3,-1) I will use a window which look 3 line before the current line to one line before the current line. But here it seems that rangeBetween is refering to the orderby variable and not the number total of lines..???

if I do "partitionBy(col("email"))" I should have one line by email but here i still get oneline by advert_id...

What I really want to do is count respectively, the number of sold item and the number of un-sold items in the 2 months preceding the advert post date, by the same email.

Is it an easy way to use what you did and apply it to my real issue?

My dataframe look like this:

var df = sc.parallelize(Array(

(1,  "2015-06-29 10:53:53.0","boulanger.fr", 1),

(2,  "2016-07-05 10:48:57.0","patissier.fr", 1),

(3,  "2017-06-28 10:31:42.0","boulanger.fr", 1),

(4,  "2017-08-21 17:31:12.0","patissier.fr", 0),

(5,  "2015-07-28 11:22:42.0","boulanger.fr", 0),

(6,  "2017-08-23 17:03:43.0","patissier.fr", 0),

(7,  "2017-08-24 16:08:07.0","boulanger.fr", 1),

(8,  "2014-08-31 17:20:43.0","patissier.fr", 1),

(9,  "2017-09-04 14:35:38.0","boulanger.fr", 1),

(10, "2012-09-07 15:10:34.0","patissier.fr", 0))).toDF("id_advert", "creation_date","email", "sold")

For each id_advert I would like to have 2 lines. One for the number of sold items and one for the number of un-sold items...

Thank you in advance!!! If it is not possible for you to unswer I will do it in a more durty way ;-).

answered Nov 19 at 15:53

Anneso

769

answered Nov 19 at 15:53

Anneso

769

answered Nov 19 at 15:53

Anneso

769

answered Nov 19 at 15:53

Anneso

769

So you want sold and unsold count rather than, count of advert by the same mailid ? Am I right. Also, would it work if both sold and unsold are part of the row i.e. come as two columns ?
– Biswanath
Nov 19 at 16:20

Yes that is exactly what I want. A dataframe with 3 columns (id_advert, sold, number_previous). I suppose that if you manage to gave me a result as 2 columns I should manage to get 2 row so yes it would be great!!! :-)
– Anneso
Nov 19 at 16:29

add a comment |

So you want sold and unsold count rather than, count of advert by the same mailid ? Am I right. Also, would it work if both sold and unsold are part of the row i.e. come as two columns ?
– Biswanath
Nov 19 at 16:20

Yes that is exactly what I want. A dataframe with 3 columns (id_advert, sold, number_previous). I suppose that if you manage to gave me a result as 2 columns I should manage to get 2 row so yes it would be great!!! :-)
– Anneso
Nov 19 at 16:29

So you want sold and unsold count rather than, count of advert by the same mailid ? Am I right. Also, would it work if both sold and unsold are part of the row i.e. come as two columns ?
– Biswanath
Nov 19 at 16:20

Yes that is exactly what I want. A dataframe with 3 columns (id_advert, sold, number_previous). I suppose that if you manage to gave me a result as 2 columns I should manage to get 2 row so yes it would be great!!! :-)
– Anneso
Nov 19 at 16:29

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Tukukkk