Use Window to count lines with if condition in scala
up vote
1
down vote
favorite
I hope that you might help me :-)
I have a dataframe with posted advert .
I want, for each id of advert to count the number of advert posted in the 2 month preceding this one, by the same email.
I created the dataframe below to explain things better:
var df = sc.parallelize(Array(
(1, "2017-06-29 10:53:53.0","boulanger.fr" ,"2017-06-28","2017-04-29"),
(2, "2017-07-05 10:48:57.0","patissier.fr","2017-07-04","2017-05-05"),
(3, "2017-06-28 10:31:42.0","boulanger.fr" ,"2017-08-16","2017-06-17"),
(4, "2017-08-21 17:31:12.0","patissier.fr","2017-08-20","2017-06-21"),
(5, "2017-07-28 11:22:42.0","boulanger.fr" ,"2017-08-22","2017-06-23"),
(6, "2017-08-23 17:03:43.0","patissier.fr","2017-08-22","2017-06-23"),
(7, "2017-08-24 16:08:07.0","boulanger.fr" ,"2017-08-23","2017-06-24"),
(8, "2017-08-31 17:20:43.0","patissier.fr","2017-08-30","2017-06-30"),
(9, "2017-09-04 14:35:38.0","boulanger.fr" ,"2017-09-03","2017-07-04"),
(10, "2017-09-07 15:10:34.0","patissier.fr","2017-09-06","2017-07-07"))).toDF("id_advert", "creation_date",
"email", "date_minus1","date_minus2m")
df = df.withColumn("date_minus1", to_date(unix_timestamp($"date_minus1", "yyyy-MM-dd").cast("timestamp")))
df = df.withColumn("date_minus2", to_date(unix_timestamp($"date_minus2", "yyyy-MM-dd").cast("timestamp")))
df = df.withColumn("date_crecreation", (unix_timestamp($"creation_date", "yyyy-MM-dd HH:mm:ss").cast("timestamp")))
- date_minus1 = the day before the advert was posted
- date_minus2m = 2 month before the advert was posted
I want to count the number of advert, with the same email, between those 2 dates...
What I want as a result is:
+---------+----------------+
|id_advert|nb_prev_advert |
+---------+----------------+
|6 |2 |
|3 |3 |
|5 |3 |
|9 |2 |
|4 |1 |
|8 |3 |
|7 |3 |
|10 |3 |
+--------+-----------------+
I manage to do that with an awfull join from the dataframe by itself but as I have millions of lines it took almost 2 hours to run...
I am sur the we can do something like:
val w = Window.partitionBy("id_advert").orderBy("creation_date").rowsBetween(-50000000, -1)
And use it to go across the dataframe and count only row with
- email of the row = email of the current_row
- date_minus2m of the row< date creation of the current row < date_minus1 of the row
scala apache-spark
add a comment |
up vote
1
down vote
favorite
I hope that you might help me :-)
I have a dataframe with posted advert .
I want, for each id of advert to count the number of advert posted in the 2 month preceding this one, by the same email.
I created the dataframe below to explain things better:
var df = sc.parallelize(Array(
(1, "2017-06-29 10:53:53.0","boulanger.fr" ,"2017-06-28","2017-04-29"),
(2, "2017-07-05 10:48:57.0","patissier.fr","2017-07-04","2017-05-05"),
(3, "2017-06-28 10:31:42.0","boulanger.fr" ,"2017-08-16","2017-06-17"),
(4, "2017-08-21 17:31:12.0","patissier.fr","2017-08-20","2017-06-21"),
(5, "2017-07-28 11:22:42.0","boulanger.fr" ,"2017-08-22","2017-06-23"),
(6, "2017-08-23 17:03:43.0","patissier.fr","2017-08-22","2017-06-23"),
(7, "2017-08-24 16:08:07.0","boulanger.fr" ,"2017-08-23","2017-06-24"),
(8, "2017-08-31 17:20:43.0","patissier.fr","2017-08-30","2017-06-30"),
(9, "2017-09-04 14:35:38.0","boulanger.fr" ,"2017-09-03","2017-07-04"),
(10, "2017-09-07 15:10:34.0","patissier.fr","2017-09-06","2017-07-07"))).toDF("id_advert", "creation_date",
"email", "date_minus1","date_minus2m")
df = df.withColumn("date_minus1", to_date(unix_timestamp($"date_minus1", "yyyy-MM-dd").cast("timestamp")))
df = df.withColumn("date_minus2", to_date(unix_timestamp($"date_minus2", "yyyy-MM-dd").cast("timestamp")))
df = df.withColumn("date_crecreation", (unix_timestamp($"creation_date", "yyyy-MM-dd HH:mm:ss").cast("timestamp")))
- date_minus1 = the day before the advert was posted
- date_minus2m = 2 month before the advert was posted
I want to count the number of advert, with the same email, between those 2 dates...
What I want as a result is:
+---------+----------------+
|id_advert|nb_prev_advert |
+---------+----------------+
|6 |2 |
|3 |3 |
|5 |3 |
|9 |2 |
|4 |1 |
|8 |3 |
|7 |3 |
|10 |3 |
+--------+-----------------+
I manage to do that with an awfull join from the dataframe by itself but as I have millions of lines it took almost 2 hours to run...
I am sur the we can do something like:
val w = Window.partitionBy("id_advert").orderBy("creation_date").rowsBetween(-50000000, -1)
And use it to go across the dataframe and count only row with
- email of the row = email of the current_row
- date_minus2m of the row< date creation of the current row < date_minus1 of the row
scala apache-spark
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I hope that you might help me :-)
I have a dataframe with posted advert .
I want, for each id of advert to count the number of advert posted in the 2 month preceding this one, by the same email.
I created the dataframe below to explain things better:
var df = sc.parallelize(Array(
(1, "2017-06-29 10:53:53.0","boulanger.fr" ,"2017-06-28","2017-04-29"),
(2, "2017-07-05 10:48:57.0","patissier.fr","2017-07-04","2017-05-05"),
(3, "2017-06-28 10:31:42.0","boulanger.fr" ,"2017-08-16","2017-06-17"),
(4, "2017-08-21 17:31:12.0","patissier.fr","2017-08-20","2017-06-21"),
(5, "2017-07-28 11:22:42.0","boulanger.fr" ,"2017-08-22","2017-06-23"),
(6, "2017-08-23 17:03:43.0","patissier.fr","2017-08-22","2017-06-23"),
(7, "2017-08-24 16:08:07.0","boulanger.fr" ,"2017-08-23","2017-06-24"),
(8, "2017-08-31 17:20:43.0","patissier.fr","2017-08-30","2017-06-30"),
(9, "2017-09-04 14:35:38.0","boulanger.fr" ,"2017-09-03","2017-07-04"),
(10, "2017-09-07 15:10:34.0","patissier.fr","2017-09-06","2017-07-07"))).toDF("id_advert", "creation_date",
"email", "date_minus1","date_minus2m")
df = df.withColumn("date_minus1", to_date(unix_timestamp($"date_minus1", "yyyy-MM-dd").cast("timestamp")))
df = df.withColumn("date_minus2", to_date(unix_timestamp($"date_minus2", "yyyy-MM-dd").cast("timestamp")))
df = df.withColumn("date_crecreation", (unix_timestamp($"creation_date", "yyyy-MM-dd HH:mm:ss").cast("timestamp")))
- date_minus1 = the day before the advert was posted
- date_minus2m = 2 month before the advert was posted
I want to count the number of advert, with the same email, between those 2 dates...
What I want as a result is:
+---------+----------------+
|id_advert|nb_prev_advert |
+---------+----------------+
|6 |2 |
|3 |3 |
|5 |3 |
|9 |2 |
|4 |1 |
|8 |3 |
|7 |3 |
|10 |3 |
+--------+-----------------+
I manage to do that with an awfull join from the dataframe by itself but as I have millions of lines it took almost 2 hours to run...
I am sur the we can do something like:
val w = Window.partitionBy("id_advert").orderBy("creation_date").rowsBetween(-50000000, -1)
And use it to go across the dataframe and count only row with
- email of the row = email of the current_row
- date_minus2m of the row< date creation of the current row < date_minus1 of the row
scala apache-spark
I hope that you might help me :-)
I have a dataframe with posted advert .
I want, for each id of advert to count the number of advert posted in the 2 month preceding this one, by the same email.
I created the dataframe below to explain things better:
var df = sc.parallelize(Array(
(1, "2017-06-29 10:53:53.0","boulanger.fr" ,"2017-06-28","2017-04-29"),
(2, "2017-07-05 10:48:57.0","patissier.fr","2017-07-04","2017-05-05"),
(3, "2017-06-28 10:31:42.0","boulanger.fr" ,"2017-08-16","2017-06-17"),
(4, "2017-08-21 17:31:12.0","patissier.fr","2017-08-20","2017-06-21"),
(5, "2017-07-28 11:22:42.0","boulanger.fr" ,"2017-08-22","2017-06-23"),
(6, "2017-08-23 17:03:43.0","patissier.fr","2017-08-22","2017-06-23"),
(7, "2017-08-24 16:08:07.0","boulanger.fr" ,"2017-08-23","2017-06-24"),
(8, "2017-08-31 17:20:43.0","patissier.fr","2017-08-30","2017-06-30"),
(9, "2017-09-04 14:35:38.0","boulanger.fr" ,"2017-09-03","2017-07-04"),
(10, "2017-09-07 15:10:34.0","patissier.fr","2017-09-06","2017-07-07"))).toDF("id_advert", "creation_date",
"email", "date_minus1","date_minus2m")
df = df.withColumn("date_minus1", to_date(unix_timestamp($"date_minus1", "yyyy-MM-dd").cast("timestamp")))
df = df.withColumn("date_minus2", to_date(unix_timestamp($"date_minus2", "yyyy-MM-dd").cast("timestamp")))
df = df.withColumn("date_crecreation", (unix_timestamp($"creation_date", "yyyy-MM-dd HH:mm:ss").cast("timestamp")))
- date_minus1 = the day before the advert was posted
- date_minus2m = 2 month before the advert was posted
I want to count the number of advert, with the same email, between those 2 dates...
What I want as a result is:
+---------+----------------+
|id_advert|nb_prev_advert |
+---------+----------------+
|6 |2 |
|3 |3 |
|5 |3 |
|9 |2 |
|4 |1 |
|8 |3 |
|7 |3 |
|10 |3 |
+--------+-----------------+
I manage to do that with an awfull join from the dataframe by itself but as I have millions of lines it took almost 2 hours to run...
I am sur the we can do something like:
val w = Window.partitionBy("id_advert").orderBy("creation_date").rowsBetween(-50000000, -1)
And use it to go across the dataframe and count only row with
- email of the row = email of the current_row
- date_minus2m of the row< date creation of the current row < date_minus1 of the row
scala apache-spark
scala apache-spark
edited Nov 16 at 8:35
asked Nov 16 at 7:58
Anneso
769
769
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
up vote
1
down vote
accepted
Adding this as different answer because it is different
Input:
df.select("*").orderBy("email","creation_date").show()
+---------+--------------------+------------+----+
|id_advert| creation_date| email|sold|
+---------+--------------------+------------+----+
| 1|2015-06-29 10:53:...|boulanger.fr| 1|
| 5|2015-07-28 11:22:...|boulanger.fr| 0|
| 3|2017-06-28 10:31:...|boulanger.fr| 1|
| 7|2017-08-24 16:08:...|boulanger.fr| 1|
| 9|2017-09-04 14:35:...|boulanger.fr| 1|
| 10|2012-09-07 15:10:...|patissier.fr| 0|
| 8|2014-08-31 17:20:...|patissier.fr| 1|
| 2|2016-07-05 10:48:...|patissier.fr| 1|
| 4|2017-08-21 17:31:...|patissier.fr| 0|
| 6|2017-08-23 17:03:...|patissier.fr| 0|
+---------+--------------------+------------+----+
Now you define your window spec as something like this
val w = Window.
partitionBy("email").
orderBy(col("creation_date").
cast("timestamp").
cast("long")).rangeBetween(-60*24*60*60,-1)
And the main query will be:
df.
select(
col("*"),count("email").over(w).alias("all_prev_mail_advert"),
sum("sold").over(w).alias("all_prev_sold_mail_advert")
).orderBy("email","creation_date").show()
Output:
+---------+--------------------+------------+----+--------------------+-------------------------+
|id_advert| creation_date| email|sold|all_prev_mail_advert|all_prev_sold_mail_advert|
+---------+--------------------+------------+----+--------------------+-------------------------+
| 1|2015-06-29 10:53:...|boulanger.fr| 1| 0| null|
| 5|2015-07-28 11:22:...|boulanger.fr| 0| 1| 1|
| 3|2017-06-28 10:31:...|boulanger.fr| 1| 0| null|
| 7|2017-08-24 16:08:...|boulanger.fr| 1| 1| 1|
| 9|2017-09-04 14:35:...|boulanger.fr| 1| 1| 1|
| 10|2012-09-07 15:10:...|patissier.fr| 0| 0| null|
| 8|2014-08-31 17:20:...|patissier.fr| 1| 0| null|
| 2|2016-07-05 10:48:...|patissier.fr| 1| 0| null|
| 4|2017-08-21 17:31:...|patissier.fr| 0| 0| null|
| 6|2017-08-23 17:03:...|patissier.fr| 0| 1| 0|
+---------+--------------------+------------+----+--------------------+-------------------------+
Explanation:
We are defining a window function for the last two months partitioned by email. And the count over this window gives all the previous advert for the same email.
And to get all the previous sold advert we are simply adding the sold column over the same window. As sold is 1 for sold item, the sum gives the count of all the sold item over this window.
add a comment |
up vote
1
down vote
Here is the answer with using Window with a range
Create a window spec with range between current and past sixty days
val w = Window
.partitionBy(col("email"))
.orderBy(col("creation_date").cast("timestamp").cast("long"))
.rangeBetween(-60*86400,-1)
Then select it over your data frame
df
.select(col("*"),count("email").over(w).alias("trailing_count"))
.orderBy("email","creation_date") //using this for display purpose
.show()
Note: Your expected output might be wrong. One, there would be at least a zero for a advert because something must be starting row for a mail. Also, count for advertid 3 seems wrong.
Input Data :
df.select("id_advert","creation_date","email").orderBy("email", "creation_date").show()
+---------+--------------------+------------+
|id_advert| creation_date| email|
+---------+--------------------+------------+
| 3|2017-06-28 10:31:...|boulanger.fr|
| 1|2017-06-29 10:53:...|boulanger.fr|
| 5|2017-07-28 11:22:...|boulanger.fr|
| 7|2017-08-24 16:08:...|boulanger.fr|
| 9|2017-09-04 14:35:...|boulanger.fr|
| 2|2017-07-05 10:48:...|patissier.fr|
| 4|2017-08-21 17:31:...|patissier.fr|
| 6|2017-08-23 17:03:...|patissier.fr|
| 8|2017-08-31 17:20:...|patissier.fr|
| 10|2017-09-07 15:10:...|patissier.fr|
+---------+--------------------+------------+
Output:
+---------+--------------------+------------+-------------+--------------+
|id_advert| creation_date| email|date_creation|trailing_count|
+---------+--------------------+------------+-------------+--------------+
| 3|2017-06-28 10:31:...|boulanger.fr| 1498645902| 0|
| 1|2017-06-29 10:53:...|boulanger.fr| 1498733633| 1|
| 5|2017-07-28 11:22:...|boulanger.fr| 1501240962| 2|
| 7|2017-08-24 16:08:...|boulanger.fr| 1503590887| 3|
| 9|2017-09-04 14:35:...|boulanger.fr| 1504535738| 2|
| 2|2017-07-05 10:48:...|patissier.fr| 1499251737| 0|
| 4|2017-08-21 17:31:...|patissier.fr| 1503336672| 1|
| 6|2017-08-23 17:03:...|patissier.fr| 1503507823| 2|
| 8|2017-08-31 17:20:...|patissier.fr| 1504200043| 3|
| 10|2017-09-07 15:10:...|patissier.fr| 1504797034| 3|
+---------+--------------------+------------+-------------+--------------+
Not sure. What is the problem for you. If you are running in sbt-console/spark-shell, I will make sure that the whole val w= ...... is in a single line. As I assume, w is just getting evaluated to only "Window" part. If possible post your whole code.
– Biswanath
Nov 19 at 15:12
Thank you very much!!! It works! However I just do one modification to avoid an error: "df.withColumn("trailing_count", count("email") over w).show"
– Anneso
Nov 19 at 15:26
It was not easy to explain what I do not understand so I made another question (that I posted as an answer... sorry for that)
– Anneso
Nov 19 at 15:54
add a comment |
up vote
0
down vote
As it impossible to structure correctly a comment I will use the answer button but it is actually more a question than an answer.
I simplify the problem thinking that with your answer I might be able to do what I want to do but I am not sure to understand correclty your answer...
How does it work? To me:
- if I do .rangeBetween(-3,-1) I will use a window which look 3 line before the current line to one line before the current line. But here it seems that rangeBetween is refering to the orderby variable and not the number total of lines..???
- if I do "partitionBy(col("email"))" I should have one line by email but here i still get oneline by advert_id...
What I really want to do is count respectively, the number of sold item and the number of un-sold items in the 2 months preceding the advert post date, by the same email.
Is it an easy way to use what you did and apply it to my real issue?
My dataframe look like this:
var df = sc.parallelize(Array(
(1, "2015-06-29 10:53:53.0","boulanger.fr", 1),
(2, "2016-07-05 10:48:57.0","patissier.fr", 1),
(3, "2017-06-28 10:31:42.0","boulanger.fr", 1),
(4, "2017-08-21 17:31:12.0","patissier.fr", 0),
(5, "2015-07-28 11:22:42.0","boulanger.fr", 0),
(6, "2017-08-23 17:03:43.0","patissier.fr", 0),
(7, "2017-08-24 16:08:07.0","boulanger.fr", 1),
(8, "2014-08-31 17:20:43.0","patissier.fr", 1),
(9, "2017-09-04 14:35:38.0","boulanger.fr", 1),
(10, "2012-09-07 15:10:34.0","patissier.fr", 0))).toDF("id_advert", "creation_date","email", "sold")
For each id_advert I would like to have 2 lines. One for the number of sold items and one for the number of un-sold items...
Thank you in advance!!! If it is not possible for you to unswer I will do it in a more durty way ;-).
So you want sold and unsold count rather than, count of advert by the same mailid ? Am I right. Also, would it work if both sold and unsold are part of the row i.e. come as two columns ?
– Biswanath
Nov 19 at 16:20
Yes that is exactly what I want. A dataframe with 3 columns (id_advert, sold, number_previous). I suppose that if you manage to gave me a result as 2 columns I should manage to get 2 row so yes it would be great!!! :-)
– Anneso
Nov 19 at 16:29
add a comment |
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
Adding this as different answer because it is different
Input:
df.select("*").orderBy("email","creation_date").show()
+---------+--------------------+------------+----+
|id_advert| creation_date| email|sold|
+---------+--------------------+------------+----+
| 1|2015-06-29 10:53:...|boulanger.fr| 1|
| 5|2015-07-28 11:22:...|boulanger.fr| 0|
| 3|2017-06-28 10:31:...|boulanger.fr| 1|
| 7|2017-08-24 16:08:...|boulanger.fr| 1|
| 9|2017-09-04 14:35:...|boulanger.fr| 1|
| 10|2012-09-07 15:10:...|patissier.fr| 0|
| 8|2014-08-31 17:20:...|patissier.fr| 1|
| 2|2016-07-05 10:48:...|patissier.fr| 1|
| 4|2017-08-21 17:31:...|patissier.fr| 0|
| 6|2017-08-23 17:03:...|patissier.fr| 0|
+---------+--------------------+------------+----+
Now you define your window spec as something like this
val w = Window.
partitionBy("email").
orderBy(col("creation_date").
cast("timestamp").
cast("long")).rangeBetween(-60*24*60*60,-1)
And the main query will be:
df.
select(
col("*"),count("email").over(w).alias("all_prev_mail_advert"),
sum("sold").over(w).alias("all_prev_sold_mail_advert")
).orderBy("email","creation_date").show()
Output:
+---------+--------------------+------------+----+--------------------+-------------------------+
|id_advert| creation_date| email|sold|all_prev_mail_advert|all_prev_sold_mail_advert|
+---------+--------------------+------------+----+--------------------+-------------------------+
| 1|2015-06-29 10:53:...|boulanger.fr| 1| 0| null|
| 5|2015-07-28 11:22:...|boulanger.fr| 0| 1| 1|
| 3|2017-06-28 10:31:...|boulanger.fr| 1| 0| null|
| 7|2017-08-24 16:08:...|boulanger.fr| 1| 1| 1|
| 9|2017-09-04 14:35:...|boulanger.fr| 1| 1| 1|
| 10|2012-09-07 15:10:...|patissier.fr| 0| 0| null|
| 8|2014-08-31 17:20:...|patissier.fr| 1| 0| null|
| 2|2016-07-05 10:48:...|patissier.fr| 1| 0| null|
| 4|2017-08-21 17:31:...|patissier.fr| 0| 0| null|
| 6|2017-08-23 17:03:...|patissier.fr| 0| 1| 0|
+---------+--------------------+------------+----+--------------------+-------------------------+
Explanation:
We are defining a window function for the last two months partitioned by email. And the count over this window gives all the previous advert for the same email.
And to get all the previous sold advert we are simply adding the sold column over the same window. As sold is 1 for sold item, the sum gives the count of all the sold item over this window.
add a comment |
up vote
1
down vote
accepted
Adding this as different answer because it is different
Input:
df.select("*").orderBy("email","creation_date").show()
+---------+--------------------+------------+----+
|id_advert| creation_date| email|sold|
+---------+--------------------+------------+----+
| 1|2015-06-29 10:53:...|boulanger.fr| 1|
| 5|2015-07-28 11:22:...|boulanger.fr| 0|
| 3|2017-06-28 10:31:...|boulanger.fr| 1|
| 7|2017-08-24 16:08:...|boulanger.fr| 1|
| 9|2017-09-04 14:35:...|boulanger.fr| 1|
| 10|2012-09-07 15:10:...|patissier.fr| 0|
| 8|2014-08-31 17:20:...|patissier.fr| 1|
| 2|2016-07-05 10:48:...|patissier.fr| 1|
| 4|2017-08-21 17:31:...|patissier.fr| 0|
| 6|2017-08-23 17:03:...|patissier.fr| 0|
+---------+--------------------+------------+----+
Now you define your window spec as something like this
val w = Window.
partitionBy("email").
orderBy(col("creation_date").
cast("timestamp").
cast("long")).rangeBetween(-60*24*60*60,-1)
And the main query will be:
df.
select(
col("*"),count("email").over(w).alias("all_prev_mail_advert"),
sum("sold").over(w).alias("all_prev_sold_mail_advert")
).orderBy("email","creation_date").show()
Output:
+---------+--------------------+------------+----+--------------------+-------------------------+
|id_advert| creation_date| email|sold|all_prev_mail_advert|all_prev_sold_mail_advert|
+---------+--------------------+------------+----+--------------------+-------------------------+
| 1|2015-06-29 10:53:...|boulanger.fr| 1| 0| null|
| 5|2015-07-28 11:22:...|boulanger.fr| 0| 1| 1|
| 3|2017-06-28 10:31:...|boulanger.fr| 1| 0| null|
| 7|2017-08-24 16:08:...|boulanger.fr| 1| 1| 1|
| 9|2017-09-04 14:35:...|boulanger.fr| 1| 1| 1|
| 10|2012-09-07 15:10:...|patissier.fr| 0| 0| null|
| 8|2014-08-31 17:20:...|patissier.fr| 1| 0| null|
| 2|2016-07-05 10:48:...|patissier.fr| 1| 0| null|
| 4|2017-08-21 17:31:...|patissier.fr| 0| 0| null|
| 6|2017-08-23 17:03:...|patissier.fr| 0| 1| 0|
+---------+--------------------+------------+----+--------------------+-------------------------+
Explanation:
We are defining a window function for the last two months partitioned by email. And the count over this window gives all the previous advert for the same email.
And to get all the previous sold advert we are simply adding the sold column over the same window. As sold is 1 for sold item, the sum gives the count of all the sold item over this window.
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
Adding this as different answer because it is different
Input:
df.select("*").orderBy("email","creation_date").show()
+---------+--------------------+------------+----+
|id_advert| creation_date| email|sold|
+---------+--------------------+------------+----+
| 1|2015-06-29 10:53:...|boulanger.fr| 1|
| 5|2015-07-28 11:22:...|boulanger.fr| 0|
| 3|2017-06-28 10:31:...|boulanger.fr| 1|
| 7|2017-08-24 16:08:...|boulanger.fr| 1|
| 9|2017-09-04 14:35:...|boulanger.fr| 1|
| 10|2012-09-07 15:10:...|patissier.fr| 0|
| 8|2014-08-31 17:20:...|patissier.fr| 1|
| 2|2016-07-05 10:48:...|patissier.fr| 1|
| 4|2017-08-21 17:31:...|patissier.fr| 0|
| 6|2017-08-23 17:03:...|patissier.fr| 0|
+---------+--------------------+------------+----+
Now you define your window spec as something like this
val w = Window.
partitionBy("email").
orderBy(col("creation_date").
cast("timestamp").
cast("long")).rangeBetween(-60*24*60*60,-1)
And the main query will be:
df.
select(
col("*"),count("email").over(w).alias("all_prev_mail_advert"),
sum("sold").over(w).alias("all_prev_sold_mail_advert")
).orderBy("email","creation_date").show()
Output:
+---------+--------------------+------------+----+--------------------+-------------------------+
|id_advert| creation_date| email|sold|all_prev_mail_advert|all_prev_sold_mail_advert|
+---------+--------------------+------------+----+--------------------+-------------------------+
| 1|2015-06-29 10:53:...|boulanger.fr| 1| 0| null|
| 5|2015-07-28 11:22:...|boulanger.fr| 0| 1| 1|
| 3|2017-06-28 10:31:...|boulanger.fr| 1| 0| null|
| 7|2017-08-24 16:08:...|boulanger.fr| 1| 1| 1|
| 9|2017-09-04 14:35:...|boulanger.fr| 1| 1| 1|
| 10|2012-09-07 15:10:...|patissier.fr| 0| 0| null|
| 8|2014-08-31 17:20:...|patissier.fr| 1| 0| null|
| 2|2016-07-05 10:48:...|patissier.fr| 1| 0| null|
| 4|2017-08-21 17:31:...|patissier.fr| 0| 0| null|
| 6|2017-08-23 17:03:...|patissier.fr| 0| 1| 0|
+---------+--------------------+------------+----+--------------------+-------------------------+
Explanation:
We are defining a window function for the last two months partitioned by email. And the count over this window gives all the previous advert for the same email.
And to get all the previous sold advert we are simply adding the sold column over the same window. As sold is 1 for sold item, the sum gives the count of all the sold item over this window.
Adding this as different answer because it is different
Input:
df.select("*").orderBy("email","creation_date").show()
+---------+--------------------+------------+----+
|id_advert| creation_date| email|sold|
+---------+--------------------+------------+----+
| 1|2015-06-29 10:53:...|boulanger.fr| 1|
| 5|2015-07-28 11:22:...|boulanger.fr| 0|
| 3|2017-06-28 10:31:...|boulanger.fr| 1|
| 7|2017-08-24 16:08:...|boulanger.fr| 1|
| 9|2017-09-04 14:35:...|boulanger.fr| 1|
| 10|2012-09-07 15:10:...|patissier.fr| 0|
| 8|2014-08-31 17:20:...|patissier.fr| 1|
| 2|2016-07-05 10:48:...|patissier.fr| 1|
| 4|2017-08-21 17:31:...|patissier.fr| 0|
| 6|2017-08-23 17:03:...|patissier.fr| 0|
+---------+--------------------+------------+----+
Now you define your window spec as something like this
val w = Window.
partitionBy("email").
orderBy(col("creation_date").
cast("timestamp").
cast("long")).rangeBetween(-60*24*60*60,-1)
And the main query will be:
df.
select(
col("*"),count("email").over(w).alias("all_prev_mail_advert"),
sum("sold").over(w).alias("all_prev_sold_mail_advert")
).orderBy("email","creation_date").show()
Output:
+---------+--------------------+------------+----+--------------------+-------------------------+
|id_advert| creation_date| email|sold|all_prev_mail_advert|all_prev_sold_mail_advert|
+---------+--------------------+------------+----+--------------------+-------------------------+
| 1|2015-06-29 10:53:...|boulanger.fr| 1| 0| null|
| 5|2015-07-28 11:22:...|boulanger.fr| 0| 1| 1|
| 3|2017-06-28 10:31:...|boulanger.fr| 1| 0| null|
| 7|2017-08-24 16:08:...|boulanger.fr| 1| 1| 1|
| 9|2017-09-04 14:35:...|boulanger.fr| 1| 1| 1|
| 10|2012-09-07 15:10:...|patissier.fr| 0| 0| null|
| 8|2014-08-31 17:20:...|patissier.fr| 1| 0| null|
| 2|2016-07-05 10:48:...|patissier.fr| 1| 0| null|
| 4|2017-08-21 17:31:...|patissier.fr| 0| 0| null|
| 6|2017-08-23 17:03:...|patissier.fr| 0| 1| 0|
+---------+--------------------+------------+----+--------------------+-------------------------+
Explanation:
We are defining a window function for the last two months partitioned by email. And the count over this window gives all the previous advert for the same email.
And to get all the previous sold advert we are simply adding the sold column over the same window. As sold is 1 for sold item, the sum gives the count of all the sold item over this window.
answered Nov 19 at 18:01
Biswanath
5,106103556
5,106103556
add a comment |
add a comment |
up vote
1
down vote
Here is the answer with using Window with a range
Create a window spec with range between current and past sixty days
val w = Window
.partitionBy(col("email"))
.orderBy(col("creation_date").cast("timestamp").cast("long"))
.rangeBetween(-60*86400,-1)
Then select it over your data frame
df
.select(col("*"),count("email").over(w).alias("trailing_count"))
.orderBy("email","creation_date") //using this for display purpose
.show()
Note: Your expected output might be wrong. One, there would be at least a zero for a advert because something must be starting row for a mail. Also, count for advertid 3 seems wrong.
Input Data :
df.select("id_advert","creation_date","email").orderBy("email", "creation_date").show()
+---------+--------------------+------------+
|id_advert| creation_date| email|
+---------+--------------------+------------+
| 3|2017-06-28 10:31:...|boulanger.fr|
| 1|2017-06-29 10:53:...|boulanger.fr|
| 5|2017-07-28 11:22:...|boulanger.fr|
| 7|2017-08-24 16:08:...|boulanger.fr|
| 9|2017-09-04 14:35:...|boulanger.fr|
| 2|2017-07-05 10:48:...|patissier.fr|
| 4|2017-08-21 17:31:...|patissier.fr|
| 6|2017-08-23 17:03:...|patissier.fr|
| 8|2017-08-31 17:20:...|patissier.fr|
| 10|2017-09-07 15:10:...|patissier.fr|
+---------+--------------------+------------+
Output:
+---------+--------------------+------------+-------------+--------------+
|id_advert| creation_date| email|date_creation|trailing_count|
+---------+--------------------+------------+-------------+--------------+
| 3|2017-06-28 10:31:...|boulanger.fr| 1498645902| 0|
| 1|2017-06-29 10:53:...|boulanger.fr| 1498733633| 1|
| 5|2017-07-28 11:22:...|boulanger.fr| 1501240962| 2|
| 7|2017-08-24 16:08:...|boulanger.fr| 1503590887| 3|
| 9|2017-09-04 14:35:...|boulanger.fr| 1504535738| 2|
| 2|2017-07-05 10:48:...|patissier.fr| 1499251737| 0|
| 4|2017-08-21 17:31:...|patissier.fr| 1503336672| 1|
| 6|2017-08-23 17:03:...|patissier.fr| 1503507823| 2|
| 8|2017-08-31 17:20:...|patissier.fr| 1504200043| 3|
| 10|2017-09-07 15:10:...|patissier.fr| 1504797034| 3|
+---------+--------------------+------------+-------------+--------------+
Not sure. What is the problem for you. If you are running in sbt-console/spark-shell, I will make sure that the whole val w= ...... is in a single line. As I assume, w is just getting evaluated to only "Window" part. If possible post your whole code.
– Biswanath
Nov 19 at 15:12
Thank you very much!!! It works! However I just do one modification to avoid an error: "df.withColumn("trailing_count", count("email") over w).show"
– Anneso
Nov 19 at 15:26
It was not easy to explain what I do not understand so I made another question (that I posted as an answer... sorry for that)
– Anneso
Nov 19 at 15:54
add a comment |
up vote
1
down vote
Here is the answer with using Window with a range
Create a window spec with range between current and past sixty days
val w = Window
.partitionBy(col("email"))
.orderBy(col("creation_date").cast("timestamp").cast("long"))
.rangeBetween(-60*86400,-1)
Then select it over your data frame
df
.select(col("*"),count("email").over(w).alias("trailing_count"))
.orderBy("email","creation_date") //using this for display purpose
.show()
Note: Your expected output might be wrong. One, there would be at least a zero for a advert because something must be starting row for a mail. Also, count for advertid 3 seems wrong.
Input Data :
df.select("id_advert","creation_date","email").orderBy("email", "creation_date").show()
+---------+--------------------+------------+
|id_advert| creation_date| email|
+---------+--------------------+------------+
| 3|2017-06-28 10:31:...|boulanger.fr|
| 1|2017-06-29 10:53:...|boulanger.fr|
| 5|2017-07-28 11:22:...|boulanger.fr|
| 7|2017-08-24 16:08:...|boulanger.fr|
| 9|2017-09-04 14:35:...|boulanger.fr|
| 2|2017-07-05 10:48:...|patissier.fr|
| 4|2017-08-21 17:31:...|patissier.fr|
| 6|2017-08-23 17:03:...|patissier.fr|
| 8|2017-08-31 17:20:...|patissier.fr|
| 10|2017-09-07 15:10:...|patissier.fr|
+---------+--------------------+------------+
Output:
+---------+--------------------+------------+-------------+--------------+
|id_advert| creation_date| email|date_creation|trailing_count|
+---------+--------------------+------------+-------------+--------------+
| 3|2017-06-28 10:31:...|boulanger.fr| 1498645902| 0|
| 1|2017-06-29 10:53:...|boulanger.fr| 1498733633| 1|
| 5|2017-07-28 11:22:...|boulanger.fr| 1501240962| 2|
| 7|2017-08-24 16:08:...|boulanger.fr| 1503590887| 3|
| 9|2017-09-04 14:35:...|boulanger.fr| 1504535738| 2|
| 2|2017-07-05 10:48:...|patissier.fr| 1499251737| 0|
| 4|2017-08-21 17:31:...|patissier.fr| 1503336672| 1|
| 6|2017-08-23 17:03:...|patissier.fr| 1503507823| 2|
| 8|2017-08-31 17:20:...|patissier.fr| 1504200043| 3|
| 10|2017-09-07 15:10:...|patissier.fr| 1504797034| 3|
+---------+--------------------+------------+-------------+--------------+
Not sure. What is the problem for you. If you are running in sbt-console/spark-shell, I will make sure that the whole val w= ...... is in a single line. As I assume, w is just getting evaluated to only "Window" part. If possible post your whole code.
– Biswanath
Nov 19 at 15:12
Thank you very much!!! It works! However I just do one modification to avoid an error: "df.withColumn("trailing_count", count("email") over w).show"
– Anneso
Nov 19 at 15:26
It was not easy to explain what I do not understand so I made another question (that I posted as an answer... sorry for that)
– Anneso
Nov 19 at 15:54
add a comment |
up vote
1
down vote
up vote
1
down vote
Here is the answer with using Window with a range
Create a window spec with range between current and past sixty days
val w = Window
.partitionBy(col("email"))
.orderBy(col("creation_date").cast("timestamp").cast("long"))
.rangeBetween(-60*86400,-1)
Then select it over your data frame
df
.select(col("*"),count("email").over(w).alias("trailing_count"))
.orderBy("email","creation_date") //using this for display purpose
.show()
Note: Your expected output might be wrong. One, there would be at least a zero for a advert because something must be starting row for a mail. Also, count for advertid 3 seems wrong.
Input Data :
df.select("id_advert","creation_date","email").orderBy("email", "creation_date").show()
+---------+--------------------+------------+
|id_advert| creation_date| email|
+---------+--------------------+------------+
| 3|2017-06-28 10:31:...|boulanger.fr|
| 1|2017-06-29 10:53:...|boulanger.fr|
| 5|2017-07-28 11:22:...|boulanger.fr|
| 7|2017-08-24 16:08:...|boulanger.fr|
| 9|2017-09-04 14:35:...|boulanger.fr|
| 2|2017-07-05 10:48:...|patissier.fr|
| 4|2017-08-21 17:31:...|patissier.fr|
| 6|2017-08-23 17:03:...|patissier.fr|
| 8|2017-08-31 17:20:...|patissier.fr|
| 10|2017-09-07 15:10:...|patissier.fr|
+---------+--------------------+------------+
Output:
+---------+--------------------+------------+-------------+--------------+
|id_advert| creation_date| email|date_creation|trailing_count|
+---------+--------------------+------------+-------------+--------------+
| 3|2017-06-28 10:31:...|boulanger.fr| 1498645902| 0|
| 1|2017-06-29 10:53:...|boulanger.fr| 1498733633| 1|
| 5|2017-07-28 11:22:...|boulanger.fr| 1501240962| 2|
| 7|2017-08-24 16:08:...|boulanger.fr| 1503590887| 3|
| 9|2017-09-04 14:35:...|boulanger.fr| 1504535738| 2|
| 2|2017-07-05 10:48:...|patissier.fr| 1499251737| 0|
| 4|2017-08-21 17:31:...|patissier.fr| 1503336672| 1|
| 6|2017-08-23 17:03:...|patissier.fr| 1503507823| 2|
| 8|2017-08-31 17:20:...|patissier.fr| 1504200043| 3|
| 10|2017-09-07 15:10:...|patissier.fr| 1504797034| 3|
+---------+--------------------+------------+-------------+--------------+
Here is the answer with using Window with a range
Create a window spec with range between current and past sixty days
val w = Window
.partitionBy(col("email"))
.orderBy(col("creation_date").cast("timestamp").cast("long"))
.rangeBetween(-60*86400,-1)
Then select it over your data frame
df
.select(col("*"),count("email").over(w).alias("trailing_count"))
.orderBy("email","creation_date") //using this for display purpose
.show()
Note: Your expected output might be wrong. One, there would be at least a zero for a advert because something must be starting row for a mail. Also, count for advertid 3 seems wrong.
Input Data :
df.select("id_advert","creation_date","email").orderBy("email", "creation_date").show()
+---------+--------------------+------------+
|id_advert| creation_date| email|
+---------+--------------------+------------+
| 3|2017-06-28 10:31:...|boulanger.fr|
| 1|2017-06-29 10:53:...|boulanger.fr|
| 5|2017-07-28 11:22:...|boulanger.fr|
| 7|2017-08-24 16:08:...|boulanger.fr|
| 9|2017-09-04 14:35:...|boulanger.fr|
| 2|2017-07-05 10:48:...|patissier.fr|
| 4|2017-08-21 17:31:...|patissier.fr|
| 6|2017-08-23 17:03:...|patissier.fr|
| 8|2017-08-31 17:20:...|patissier.fr|
| 10|2017-09-07 15:10:...|patissier.fr|
+---------+--------------------+------------+
Output:
+---------+--------------------+------------+-------------+--------------+
|id_advert| creation_date| email|date_creation|trailing_count|
+---------+--------------------+------------+-------------+--------------+
| 3|2017-06-28 10:31:...|boulanger.fr| 1498645902| 0|
| 1|2017-06-29 10:53:...|boulanger.fr| 1498733633| 1|
| 5|2017-07-28 11:22:...|boulanger.fr| 1501240962| 2|
| 7|2017-08-24 16:08:...|boulanger.fr| 1503590887| 3|
| 9|2017-09-04 14:35:...|boulanger.fr| 1504535738| 2|
| 2|2017-07-05 10:48:...|patissier.fr| 1499251737| 0|
| 4|2017-08-21 17:31:...|patissier.fr| 1503336672| 1|
| 6|2017-08-23 17:03:...|patissier.fr| 1503507823| 2|
| 8|2017-08-31 17:20:...|patissier.fr| 1504200043| 3|
| 10|2017-09-07 15:10:...|patissier.fr| 1504797034| 3|
+---------+--------------------+------------+-------------+--------------+
answered Nov 18 at 18:45
Biswanath
5,106103556
5,106103556
Not sure. What is the problem for you. If you are running in sbt-console/spark-shell, I will make sure that the whole val w= ...... is in a single line. As I assume, w is just getting evaluated to only "Window" part. If possible post your whole code.
– Biswanath
Nov 19 at 15:12
Thank you very much!!! It works! However I just do one modification to avoid an error: "df.withColumn("trailing_count", count("email") over w).show"
– Anneso
Nov 19 at 15:26
It was not easy to explain what I do not understand so I made another question (that I posted as an answer... sorry for that)
– Anneso
Nov 19 at 15:54
add a comment |
Not sure. What is the problem for you. If you are running in sbt-console/spark-shell, I will make sure that the whole val w= ...... is in a single line. As I assume, w is just getting evaluated to only "Window" part. If possible post your whole code.
– Biswanath
Nov 19 at 15:12
Thank you very much!!! It works! However I just do one modification to avoid an error: "df.withColumn("trailing_count", count("email") over w).show"
– Anneso
Nov 19 at 15:26
It was not easy to explain what I do not understand so I made another question (that I posted as an answer... sorry for that)
– Anneso
Nov 19 at 15:54
Not sure. What is the problem for you. If you are running in sbt-console/spark-shell, I will make sure that the whole val w= ...... is in a single line. As I assume, w is just getting evaluated to only "Window" part. If possible post your whole code.
– Biswanath
Nov 19 at 15:12
Not sure. What is the problem for you. If you are running in sbt-console/spark-shell, I will make sure that the whole val w= ...... is in a single line. As I assume, w is just getting evaluated to only "Window" part. If possible post your whole code.
– Biswanath
Nov 19 at 15:12
Thank you very much!!! It works! However I just do one modification to avoid an error: "df.withColumn("trailing_count", count("email") over w).show"
– Anneso
Nov 19 at 15:26
Thank you very much!!! It works! However I just do one modification to avoid an error: "df.withColumn("trailing_count", count("email") over w).show"
– Anneso
Nov 19 at 15:26
It was not easy to explain what I do not understand so I made another question (that I posted as an answer... sorry for that)
– Anneso
Nov 19 at 15:54
It was not easy to explain what I do not understand so I made another question (that I posted as an answer... sorry for that)
– Anneso
Nov 19 at 15:54
add a comment |
up vote
0
down vote
As it impossible to structure correctly a comment I will use the answer button but it is actually more a question than an answer.
I simplify the problem thinking that with your answer I might be able to do what I want to do but I am not sure to understand correclty your answer...
How does it work? To me:
- if I do .rangeBetween(-3,-1) I will use a window which look 3 line before the current line to one line before the current line. But here it seems that rangeBetween is refering to the orderby variable and not the number total of lines..???
- if I do "partitionBy(col("email"))" I should have one line by email but here i still get oneline by advert_id...
What I really want to do is count respectively, the number of sold item and the number of un-sold items in the 2 months preceding the advert post date, by the same email.
Is it an easy way to use what you did and apply it to my real issue?
My dataframe look like this:
var df = sc.parallelize(Array(
(1, "2015-06-29 10:53:53.0","boulanger.fr", 1),
(2, "2016-07-05 10:48:57.0","patissier.fr", 1),
(3, "2017-06-28 10:31:42.0","boulanger.fr", 1),
(4, "2017-08-21 17:31:12.0","patissier.fr", 0),
(5, "2015-07-28 11:22:42.0","boulanger.fr", 0),
(6, "2017-08-23 17:03:43.0","patissier.fr", 0),
(7, "2017-08-24 16:08:07.0","boulanger.fr", 1),
(8, "2014-08-31 17:20:43.0","patissier.fr", 1),
(9, "2017-09-04 14:35:38.0","boulanger.fr", 1),
(10, "2012-09-07 15:10:34.0","patissier.fr", 0))).toDF("id_advert", "creation_date","email", "sold")
For each id_advert I would like to have 2 lines. One for the number of sold items and one for the number of un-sold items...
Thank you in advance!!! If it is not possible for you to unswer I will do it in a more durty way ;-).
So you want sold and unsold count rather than, count of advert by the same mailid ? Am I right. Also, would it work if both sold and unsold are part of the row i.e. come as two columns ?
– Biswanath
Nov 19 at 16:20
Yes that is exactly what I want. A dataframe with 3 columns (id_advert, sold, number_previous). I suppose that if you manage to gave me a result as 2 columns I should manage to get 2 row so yes it would be great!!! :-)
– Anneso
Nov 19 at 16:29
add a comment |
up vote
0
down vote
As it impossible to structure correctly a comment I will use the answer button but it is actually more a question than an answer.
I simplify the problem thinking that with your answer I might be able to do what I want to do but I am not sure to understand correclty your answer...
How does it work? To me:
- if I do .rangeBetween(-3,-1) I will use a window which look 3 line before the current line to one line before the current line. But here it seems that rangeBetween is refering to the orderby variable and not the number total of lines..???
- if I do "partitionBy(col("email"))" I should have one line by email but here i still get oneline by advert_id...
What I really want to do is count respectively, the number of sold item and the number of un-sold items in the 2 months preceding the advert post date, by the same email.
Is it an easy way to use what you did and apply it to my real issue?
My dataframe look like this:
var df = sc.parallelize(Array(
(1, "2015-06-29 10:53:53.0","boulanger.fr", 1),
(2, "2016-07-05 10:48:57.0","patissier.fr", 1),
(3, "2017-06-28 10:31:42.0","boulanger.fr", 1),
(4, "2017-08-21 17:31:12.0","patissier.fr", 0),
(5, "2015-07-28 11:22:42.0","boulanger.fr", 0),
(6, "2017-08-23 17:03:43.0","patissier.fr", 0),
(7, "2017-08-24 16:08:07.0","boulanger.fr", 1),
(8, "2014-08-31 17:20:43.0","patissier.fr", 1),
(9, "2017-09-04 14:35:38.0","boulanger.fr", 1),
(10, "2012-09-07 15:10:34.0","patissier.fr", 0))).toDF("id_advert", "creation_date","email", "sold")
For each id_advert I would like to have 2 lines. One for the number of sold items and one for the number of un-sold items...
Thank you in advance!!! If it is not possible for you to unswer I will do it in a more durty way ;-).
So you want sold and unsold count rather than, count of advert by the same mailid ? Am I right. Also, would it work if both sold and unsold are part of the row i.e. come as two columns ?
– Biswanath
Nov 19 at 16:20
Yes that is exactly what I want. A dataframe with 3 columns (id_advert, sold, number_previous). I suppose that if you manage to gave me a result as 2 columns I should manage to get 2 row so yes it would be great!!! :-)
– Anneso
Nov 19 at 16:29
add a comment |
up vote
0
down vote
up vote
0
down vote
As it impossible to structure correctly a comment I will use the answer button but it is actually more a question than an answer.
I simplify the problem thinking that with your answer I might be able to do what I want to do but I am not sure to understand correclty your answer...
How does it work? To me:
- if I do .rangeBetween(-3,-1) I will use a window which look 3 line before the current line to one line before the current line. But here it seems that rangeBetween is refering to the orderby variable and not the number total of lines..???
- if I do "partitionBy(col("email"))" I should have one line by email but here i still get oneline by advert_id...
What I really want to do is count respectively, the number of sold item and the number of un-sold items in the 2 months preceding the advert post date, by the same email.
Is it an easy way to use what you did and apply it to my real issue?
My dataframe look like this:
var df = sc.parallelize(Array(
(1, "2015-06-29 10:53:53.0","boulanger.fr", 1),
(2, "2016-07-05 10:48:57.0","patissier.fr", 1),
(3, "2017-06-28 10:31:42.0","boulanger.fr", 1),
(4, "2017-08-21 17:31:12.0","patissier.fr", 0),
(5, "2015-07-28 11:22:42.0","boulanger.fr", 0),
(6, "2017-08-23 17:03:43.0","patissier.fr", 0),
(7, "2017-08-24 16:08:07.0","boulanger.fr", 1),
(8, "2014-08-31 17:20:43.0","patissier.fr", 1),
(9, "2017-09-04 14:35:38.0","boulanger.fr", 1),
(10, "2012-09-07 15:10:34.0","patissier.fr", 0))).toDF("id_advert", "creation_date","email", "sold")
For each id_advert I would like to have 2 lines. One for the number of sold items and one for the number of un-sold items...
Thank you in advance!!! If it is not possible for you to unswer I will do it in a more durty way ;-).
As it impossible to structure correctly a comment I will use the answer button but it is actually more a question than an answer.
I simplify the problem thinking that with your answer I might be able to do what I want to do but I am not sure to understand correclty your answer...
How does it work? To me:
- if I do .rangeBetween(-3,-1) I will use a window which look 3 line before the current line to one line before the current line. But here it seems that rangeBetween is refering to the orderby variable and not the number total of lines..???
- if I do "partitionBy(col("email"))" I should have one line by email but here i still get oneline by advert_id...
What I really want to do is count respectively, the number of sold item and the number of un-sold items in the 2 months preceding the advert post date, by the same email.
Is it an easy way to use what you did and apply it to my real issue?
My dataframe look like this:
var df = sc.parallelize(Array(
(1, "2015-06-29 10:53:53.0","boulanger.fr", 1),
(2, "2016-07-05 10:48:57.0","patissier.fr", 1),
(3, "2017-06-28 10:31:42.0","boulanger.fr", 1),
(4, "2017-08-21 17:31:12.0","patissier.fr", 0),
(5, "2015-07-28 11:22:42.0","boulanger.fr", 0),
(6, "2017-08-23 17:03:43.0","patissier.fr", 0),
(7, "2017-08-24 16:08:07.0","boulanger.fr", 1),
(8, "2014-08-31 17:20:43.0","patissier.fr", 1),
(9, "2017-09-04 14:35:38.0","boulanger.fr", 1),
(10, "2012-09-07 15:10:34.0","patissier.fr", 0))).toDF("id_advert", "creation_date","email", "sold")
For each id_advert I would like to have 2 lines. One for the number of sold items and one for the number of un-sold items...
Thank you in advance!!! If it is not possible for you to unswer I will do it in a more durty way ;-).
answered Nov 19 at 15:53
Anneso
769
769
So you want sold and unsold count rather than, count of advert by the same mailid ? Am I right. Also, would it work if both sold and unsold are part of the row i.e. come as two columns ?
– Biswanath
Nov 19 at 16:20
Yes that is exactly what I want. A dataframe with 3 columns (id_advert, sold, number_previous). I suppose that if you manage to gave me a result as 2 columns I should manage to get 2 row so yes it would be great!!! :-)
– Anneso
Nov 19 at 16:29
add a comment |
So you want sold and unsold count rather than, count of advert by the same mailid ? Am I right. Also, would it work if both sold and unsold are part of the row i.e. come as two columns ?
– Biswanath
Nov 19 at 16:20
Yes that is exactly what I want. A dataframe with 3 columns (id_advert, sold, number_previous). I suppose that if you manage to gave me a result as 2 columns I should manage to get 2 row so yes it would be great!!! :-)
– Anneso
Nov 19 at 16:29
So you want sold and unsold count rather than, count of advert by the same mailid ? Am I right. Also, would it work if both sold and unsold are part of the row i.e. come as two columns ?
– Biswanath
Nov 19 at 16:20
So you want sold and unsold count rather than, count of advert by the same mailid ? Am I right. Also, would it work if both sold and unsold are part of the row i.e. come as two columns ?
– Biswanath
Nov 19 at 16:20
Yes that is exactly what I want. A dataframe with 3 columns (id_advert, sold, number_previous). I suppose that if you manage to gave me a result as 2 columns I should manage to get 2 row so yes it would be great!!! :-)
– Anneso
Nov 19 at 16:29
Yes that is exactly what I want. A dataframe with 3 columns (id_advert, sold, number_previous). I suppose that if you manage to gave me a result as 2 columns I should manage to get 2 row so yes it would be great!!! :-)
– Anneso
Nov 19 at 16:29
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53333635%2fuse-window-to-count-lines-with-if-condition-in-scala%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown