Dataframes join returns empty results in Spark Scala
I have four data frames in Spark Scala (Spark version: 2.3 and Spark-sql: 2.11 and Scala version: 2.11.0) such as:
ratingsDf
+-------+---+
|ratings| id|
+-------+---+
| 0| 1|
| 1| 2|
| 1| 3|
| 0| 4|
| 0| 5|
| 1| 6|
| 1| 7|
| 1| 8|
| 0| 9|
| 1| 10|
+-------+---+
GpredictionsDf
+-----------+---+
|gprediction| id|
+-----------+---+
| 0| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 1| 5|
| 1| 6|
| 1| 7|
| 1| 8|
| 0| 9|
| 1| 10|
+-----------+---+
RpredictionsDf
+-----------+---+
|rprediction| id|
+-----------+---+
| 0| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 1| 5|
| 1| 6|
| 1| 7|
| 1| 8|
| 1| 9|
| 1| 10|
+-----------+---+
LpredictionsDf
+-----------+---+
|lprediction| id|
+-----------+---+
| 0| 1|
| 1| 2|
| 1| 3|
| 0| 4|
| 1| 5|
| 1| 6|
| 1| 7|
| 1| 8|
| 0| 9|
| 1| 10|
+-----------+---+
I need to create a DataFrame by joining all four tables on "id" column. I tried below two ways to do this:
**Method 1: **
val ensembleDf = GpredictionsDf.join(rpredjoin, gpredjoin("id") === RpredictionsDf("id"))
.join(LpredictionsDf, LpredictionsDf("id") === RpredictionsDf("id"))
.join(ratingsDf, ratingsDf("id") === RpredictionsDf("id"))
.select("gprediction", "rprediction", "lprediction", "ratings")
**Method 2: **
ratingsDf.createOrReplaceTempView("ratingjoin");
GpredictionsDf.createOrReplaceTempView("gpredjoin")
RpredictionsDf.createOrReplaceTempView("rpredjoin")
LpredictionsDf.createOrReplaceTempView("lpredjoin")
val ensembleDf = sqlContext.sql("SELECT gprediction, rprediction, lprediction, ratings FROM gpredjoin, rpredjoin, lpredjoin, ratingjoin WHERE " +
"gpredjoin.id = rpredjoin.id AND rpredjoin.id = lpredjoin.id AND lpredjoin.id = ratingjoin.id");
However, in both cases my join failes and returns empty
ensembleDf.show();
+-----------+-----------+-----------+-------+
|gprediction|rprediction|lprediction|ratings|
+-----------+-----------+-----------+-------+
+-----------+-----------+-----------+-------+
Any idea why this could be happening? What code changes do I need to do to get this fixed?
apache-spark apache-spark-sql
add a comment |
I have four data frames in Spark Scala (Spark version: 2.3 and Spark-sql: 2.11 and Scala version: 2.11.0) such as:
ratingsDf
+-------+---+
|ratings| id|
+-------+---+
| 0| 1|
| 1| 2|
| 1| 3|
| 0| 4|
| 0| 5|
| 1| 6|
| 1| 7|
| 1| 8|
| 0| 9|
| 1| 10|
+-------+---+
GpredictionsDf
+-----------+---+
|gprediction| id|
+-----------+---+
| 0| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 1| 5|
| 1| 6|
| 1| 7|
| 1| 8|
| 0| 9|
| 1| 10|
+-----------+---+
RpredictionsDf
+-----------+---+
|rprediction| id|
+-----------+---+
| 0| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 1| 5|
| 1| 6|
| 1| 7|
| 1| 8|
| 1| 9|
| 1| 10|
+-----------+---+
LpredictionsDf
+-----------+---+
|lprediction| id|
+-----------+---+
| 0| 1|
| 1| 2|
| 1| 3|
| 0| 4|
| 1| 5|
| 1| 6|
| 1| 7|
| 1| 8|
| 0| 9|
| 1| 10|
+-----------+---+
I need to create a DataFrame by joining all four tables on "id" column. I tried below two ways to do this:
**Method 1: **
val ensembleDf = GpredictionsDf.join(rpredjoin, gpredjoin("id") === RpredictionsDf("id"))
.join(LpredictionsDf, LpredictionsDf("id") === RpredictionsDf("id"))
.join(ratingsDf, ratingsDf("id") === RpredictionsDf("id"))
.select("gprediction", "rprediction", "lprediction", "ratings")
**Method 2: **
ratingsDf.createOrReplaceTempView("ratingjoin");
GpredictionsDf.createOrReplaceTempView("gpredjoin")
RpredictionsDf.createOrReplaceTempView("rpredjoin")
LpredictionsDf.createOrReplaceTempView("lpredjoin")
val ensembleDf = sqlContext.sql("SELECT gprediction, rprediction, lprediction, ratings FROM gpredjoin, rpredjoin, lpredjoin, ratingjoin WHERE " +
"gpredjoin.id = rpredjoin.id AND rpredjoin.id = lpredjoin.id AND lpredjoin.id = ratingjoin.id");
However, in both cases my join failes and returns empty
ensembleDf.show();
+-----------+-----------+-----------+-------+
|gprediction|rprediction|lprediction|ratings|
+-----------+-----------+-----------+-------+
+-----------+-----------+-----------+-------+
Any idea why this could be happening? What code changes do I need to do to get this fixed?
apache-spark apache-spark-sql
Could you please follow the instructions from How to make good reproducible Apache Spark Dataframe examples and include reproducible data and Spark version? Thanks.
– user10465355
Nov 25 '18 at 23:24
I have updated it accordingly
– Nick
Nov 25 '18 at 23:43
All of these including rpredjoin and gpredjoin are dataframes only. There are no hive tables here
– Nick
Nov 26 '18 at 0:59
Your joins in Method 1 look correct except that temp views were being mixed with dataframes. ReplacingGpredictionsDf.join(rpredjoin, gpredjoin("id") === RpredictionsDf("id"))
withGpredictionsDf.join(RpredictionsDf, GpredictionsDf("id") === RpredictionsDf("id"))
should fix the problem.
– Leo C
Nov 26 '18 at 1:46
I added val ensemble = GpredictionsDf.join(RpredictionsDf, GpredictionsDf("id") === RpredictionsDf("id")) .join(LpredictionsDf, LpredictionsDf("id") === RpredictionsDf("id")) .join(ratingsDf, ratingsDf("id") === RpredictionsDf("id")) .select("gprediction", "rprediction", "lprediction", "ratings"); It still shows empty dataset
– Nick
Nov 26 '18 at 3:01
add a comment |
I have four data frames in Spark Scala (Spark version: 2.3 and Spark-sql: 2.11 and Scala version: 2.11.0) such as:
ratingsDf
+-------+---+
|ratings| id|
+-------+---+
| 0| 1|
| 1| 2|
| 1| 3|
| 0| 4|
| 0| 5|
| 1| 6|
| 1| 7|
| 1| 8|
| 0| 9|
| 1| 10|
+-------+---+
GpredictionsDf
+-----------+---+
|gprediction| id|
+-----------+---+
| 0| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 1| 5|
| 1| 6|
| 1| 7|
| 1| 8|
| 0| 9|
| 1| 10|
+-----------+---+
RpredictionsDf
+-----------+---+
|rprediction| id|
+-----------+---+
| 0| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 1| 5|
| 1| 6|
| 1| 7|
| 1| 8|
| 1| 9|
| 1| 10|
+-----------+---+
LpredictionsDf
+-----------+---+
|lprediction| id|
+-----------+---+
| 0| 1|
| 1| 2|
| 1| 3|
| 0| 4|
| 1| 5|
| 1| 6|
| 1| 7|
| 1| 8|
| 0| 9|
| 1| 10|
+-----------+---+
I need to create a DataFrame by joining all four tables on "id" column. I tried below two ways to do this:
**Method 1: **
val ensembleDf = GpredictionsDf.join(rpredjoin, gpredjoin("id") === RpredictionsDf("id"))
.join(LpredictionsDf, LpredictionsDf("id") === RpredictionsDf("id"))
.join(ratingsDf, ratingsDf("id") === RpredictionsDf("id"))
.select("gprediction", "rprediction", "lprediction", "ratings")
**Method 2: **
ratingsDf.createOrReplaceTempView("ratingjoin");
GpredictionsDf.createOrReplaceTempView("gpredjoin")
RpredictionsDf.createOrReplaceTempView("rpredjoin")
LpredictionsDf.createOrReplaceTempView("lpredjoin")
val ensembleDf = sqlContext.sql("SELECT gprediction, rprediction, lprediction, ratings FROM gpredjoin, rpredjoin, lpredjoin, ratingjoin WHERE " +
"gpredjoin.id = rpredjoin.id AND rpredjoin.id = lpredjoin.id AND lpredjoin.id = ratingjoin.id");
However, in both cases my join failes and returns empty
ensembleDf.show();
+-----------+-----------+-----------+-------+
|gprediction|rprediction|lprediction|ratings|
+-----------+-----------+-----------+-------+
+-----------+-----------+-----------+-------+
Any idea why this could be happening? What code changes do I need to do to get this fixed?
apache-spark apache-spark-sql
I have four data frames in Spark Scala (Spark version: 2.3 and Spark-sql: 2.11 and Scala version: 2.11.0) such as:
ratingsDf
+-------+---+
|ratings| id|
+-------+---+
| 0| 1|
| 1| 2|
| 1| 3|
| 0| 4|
| 0| 5|
| 1| 6|
| 1| 7|
| 1| 8|
| 0| 9|
| 1| 10|
+-------+---+
GpredictionsDf
+-----------+---+
|gprediction| id|
+-----------+---+
| 0| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 1| 5|
| 1| 6|
| 1| 7|
| 1| 8|
| 0| 9|
| 1| 10|
+-----------+---+
RpredictionsDf
+-----------+---+
|rprediction| id|
+-----------+---+
| 0| 1|
| 1| 2|
| 1| 3|
| 1| 4|
| 1| 5|
| 1| 6|
| 1| 7|
| 1| 8|
| 1| 9|
| 1| 10|
+-----------+---+
LpredictionsDf
+-----------+---+
|lprediction| id|
+-----------+---+
| 0| 1|
| 1| 2|
| 1| 3|
| 0| 4|
| 1| 5|
| 1| 6|
| 1| 7|
| 1| 8|
| 0| 9|
| 1| 10|
+-----------+---+
I need to create a DataFrame by joining all four tables on "id" column. I tried below two ways to do this:
**Method 1: **
val ensembleDf = GpredictionsDf.join(rpredjoin, gpredjoin("id") === RpredictionsDf("id"))
.join(LpredictionsDf, LpredictionsDf("id") === RpredictionsDf("id"))
.join(ratingsDf, ratingsDf("id") === RpredictionsDf("id"))
.select("gprediction", "rprediction", "lprediction", "ratings")
**Method 2: **
ratingsDf.createOrReplaceTempView("ratingjoin");
GpredictionsDf.createOrReplaceTempView("gpredjoin")
RpredictionsDf.createOrReplaceTempView("rpredjoin")
LpredictionsDf.createOrReplaceTempView("lpredjoin")
val ensembleDf = sqlContext.sql("SELECT gprediction, rprediction, lprediction, ratings FROM gpredjoin, rpredjoin, lpredjoin, ratingjoin WHERE " +
"gpredjoin.id = rpredjoin.id AND rpredjoin.id = lpredjoin.id AND lpredjoin.id = ratingjoin.id");
However, in both cases my join failes and returns empty
ensembleDf.show();
+-----------+-----------+-----------+-------+
|gprediction|rprediction|lprediction|ratings|
+-----------+-----------+-----------+-------+
+-----------+-----------+-----------+-------+
Any idea why this could be happening? What code changes do I need to do to get this fixed?
apache-spark apache-spark-sql
apache-spark apache-spark-sql
edited Nov 25 '18 at 23:39
Nick
asked Nov 25 '18 at 22:30
NickNick
98110
98110
Could you please follow the instructions from How to make good reproducible Apache Spark Dataframe examples and include reproducible data and Spark version? Thanks.
– user10465355
Nov 25 '18 at 23:24
I have updated it accordingly
– Nick
Nov 25 '18 at 23:43
All of these including rpredjoin and gpredjoin are dataframes only. There are no hive tables here
– Nick
Nov 26 '18 at 0:59
Your joins in Method 1 look correct except that temp views were being mixed with dataframes. ReplacingGpredictionsDf.join(rpredjoin, gpredjoin("id") === RpredictionsDf("id"))
withGpredictionsDf.join(RpredictionsDf, GpredictionsDf("id") === RpredictionsDf("id"))
should fix the problem.
– Leo C
Nov 26 '18 at 1:46
I added val ensemble = GpredictionsDf.join(RpredictionsDf, GpredictionsDf("id") === RpredictionsDf("id")) .join(LpredictionsDf, LpredictionsDf("id") === RpredictionsDf("id")) .join(ratingsDf, ratingsDf("id") === RpredictionsDf("id")) .select("gprediction", "rprediction", "lprediction", "ratings"); It still shows empty dataset
– Nick
Nov 26 '18 at 3:01
add a comment |
Could you please follow the instructions from How to make good reproducible Apache Spark Dataframe examples and include reproducible data and Spark version? Thanks.
– user10465355
Nov 25 '18 at 23:24
I have updated it accordingly
– Nick
Nov 25 '18 at 23:43
All of these including rpredjoin and gpredjoin are dataframes only. There are no hive tables here
– Nick
Nov 26 '18 at 0:59
Your joins in Method 1 look correct except that temp views were being mixed with dataframes. ReplacingGpredictionsDf.join(rpredjoin, gpredjoin("id") === RpredictionsDf("id"))
withGpredictionsDf.join(RpredictionsDf, GpredictionsDf("id") === RpredictionsDf("id"))
should fix the problem.
– Leo C
Nov 26 '18 at 1:46
I added val ensemble = GpredictionsDf.join(RpredictionsDf, GpredictionsDf("id") === RpredictionsDf("id")) .join(LpredictionsDf, LpredictionsDf("id") === RpredictionsDf("id")) .join(ratingsDf, ratingsDf("id") === RpredictionsDf("id")) .select("gprediction", "rprediction", "lprediction", "ratings"); It still shows empty dataset
– Nick
Nov 26 '18 at 3:01
Could you please follow the instructions from How to make good reproducible Apache Spark Dataframe examples and include reproducible data and Spark version? Thanks.
– user10465355
Nov 25 '18 at 23:24
Could you please follow the instructions from How to make good reproducible Apache Spark Dataframe examples and include reproducible data and Spark version? Thanks.
– user10465355
Nov 25 '18 at 23:24
I have updated it accordingly
– Nick
Nov 25 '18 at 23:43
I have updated it accordingly
– Nick
Nov 25 '18 at 23:43
All of these including rpredjoin and gpredjoin are dataframes only. There are no hive tables here
– Nick
Nov 26 '18 at 0:59
All of these including rpredjoin and gpredjoin are dataframes only. There are no hive tables here
– Nick
Nov 26 '18 at 0:59
Your joins in Method 1 look correct except that temp views were being mixed with dataframes. Replacing
GpredictionsDf.join(rpredjoin, gpredjoin("id") === RpredictionsDf("id"))
with GpredictionsDf.join(RpredictionsDf, GpredictionsDf("id") === RpredictionsDf("id"))
should fix the problem.– Leo C
Nov 26 '18 at 1:46
Your joins in Method 1 look correct except that temp views were being mixed with dataframes. Replacing
GpredictionsDf.join(rpredjoin, gpredjoin("id") === RpredictionsDf("id"))
with GpredictionsDf.join(RpredictionsDf, GpredictionsDf("id") === RpredictionsDf("id"))
should fix the problem.– Leo C
Nov 26 '18 at 1:46
I added val ensemble = GpredictionsDf.join(RpredictionsDf, GpredictionsDf("id") === RpredictionsDf("id")) .join(LpredictionsDf, LpredictionsDf("id") === RpredictionsDf("id")) .join(ratingsDf, ratingsDf("id") === RpredictionsDf("id")) .select("gprediction", "rprediction", "lprediction", "ratings"); It still shows empty dataset
– Nick
Nov 26 '18 at 3:01
I added val ensemble = GpredictionsDf.join(RpredictionsDf, GpredictionsDf("id") === RpredictionsDf("id")) .join(LpredictionsDf, LpredictionsDf("id") === RpredictionsDf("id")) .join(ratingsDf, ratingsDf("id") === RpredictionsDf("id")) .select("gprediction", "rprediction", "lprediction", "ratings"); It still shows empty dataset
– Nick
Nov 26 '18 at 3:01
add a comment |
1 Answer
1
active
oldest
votes
scala> val ratingsDf = Seq((0,1),(1,2),(1,3),(0,4),(0,5),(1,6),(1,7),(1,8),(0,9),(1,10)).toDF("ratings","id")
scala> val GpredictionsDf = Seq((0,1),(1,2),(1,3),(1,4),(1,5),(1,6),(1,7),(1,8),(0,9),(1,10)).toDF("gprediction", "id")
scala> val RpredictionsDf = Seq((0,1),(1,2),(1,3),(1,4),(1,5),(1,6),(1,7),(1,8),(1,9),(1,10)).toDF("rprediction", "id")
scala> val LpredictionsDf = Seq((0,1),(1,2),(1,3),(0,4),(1,5),(1,6),(1,7),(1,8),(0,9),(1,10)).toDF("lprediction", "id")
scala> val ensembleDf = GpredictionsDf.join(RpredictionsDf, GpredictionsDf("id") === RpredictionsDf("id") ).join(LpredictionsDf, LpredictionsDf("id") === RpredictionsDf("id")).join(ratingsDf, ratingsDf("id") === RpredictionsDf("id")).select("gprediction", "rprediction", "lprediction", "ratings")
scala> ensembleDf.show
+-----------+-----------+-----------+-------+
|gprediction|rprediction|lprediction|ratings|
+-----------+-----------+-----------+-------+
| 0| 0| 0| 0|
| 1| 1| 1| 1|
| 1| 1| 1| 1|
| 1| 1| 0| 0|
| 1| 1| 1| 0|
| 1| 1| 1| 1|
| 1| 1| 1| 1|
| 1| 1| 1| 1|
| 0| 1| 0| 0|
| 1| 1| 1| 1|
+-----------+-----------+-----------+-------+
This is what I tried and it is giving the correct values. I would recommend you to check the DFs you are using for joining.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53472667%2fdataframes-join-returns-empty-results-in-spark-scala%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
scala> val ratingsDf = Seq((0,1),(1,2),(1,3),(0,4),(0,5),(1,6),(1,7),(1,8),(0,9),(1,10)).toDF("ratings","id")
scala> val GpredictionsDf = Seq((0,1),(1,2),(1,3),(1,4),(1,5),(1,6),(1,7),(1,8),(0,9),(1,10)).toDF("gprediction", "id")
scala> val RpredictionsDf = Seq((0,1),(1,2),(1,3),(1,4),(1,5),(1,6),(1,7),(1,8),(1,9),(1,10)).toDF("rprediction", "id")
scala> val LpredictionsDf = Seq((0,1),(1,2),(1,3),(0,4),(1,5),(1,6),(1,7),(1,8),(0,9),(1,10)).toDF("lprediction", "id")
scala> val ensembleDf = GpredictionsDf.join(RpredictionsDf, GpredictionsDf("id") === RpredictionsDf("id") ).join(LpredictionsDf, LpredictionsDf("id") === RpredictionsDf("id")).join(ratingsDf, ratingsDf("id") === RpredictionsDf("id")).select("gprediction", "rprediction", "lprediction", "ratings")
scala> ensembleDf.show
+-----------+-----------+-----------+-------+
|gprediction|rprediction|lprediction|ratings|
+-----------+-----------+-----------+-------+
| 0| 0| 0| 0|
| 1| 1| 1| 1|
| 1| 1| 1| 1|
| 1| 1| 0| 0|
| 1| 1| 1| 0|
| 1| 1| 1| 1|
| 1| 1| 1| 1|
| 1| 1| 1| 1|
| 0| 1| 0| 0|
| 1| 1| 1| 1|
+-----------+-----------+-----------+-------+
This is what I tried and it is giving the correct values. I would recommend you to check the DFs you are using for joining.
add a comment |
scala> val ratingsDf = Seq((0,1),(1,2),(1,3),(0,4),(0,5),(1,6),(1,7),(1,8),(0,9),(1,10)).toDF("ratings","id")
scala> val GpredictionsDf = Seq((0,1),(1,2),(1,3),(1,4),(1,5),(1,6),(1,7),(1,8),(0,9),(1,10)).toDF("gprediction", "id")
scala> val RpredictionsDf = Seq((0,1),(1,2),(1,3),(1,4),(1,5),(1,6),(1,7),(1,8),(1,9),(1,10)).toDF("rprediction", "id")
scala> val LpredictionsDf = Seq((0,1),(1,2),(1,3),(0,4),(1,5),(1,6),(1,7),(1,8),(0,9),(1,10)).toDF("lprediction", "id")
scala> val ensembleDf = GpredictionsDf.join(RpredictionsDf, GpredictionsDf("id") === RpredictionsDf("id") ).join(LpredictionsDf, LpredictionsDf("id") === RpredictionsDf("id")).join(ratingsDf, ratingsDf("id") === RpredictionsDf("id")).select("gprediction", "rprediction", "lprediction", "ratings")
scala> ensembleDf.show
+-----------+-----------+-----------+-------+
|gprediction|rprediction|lprediction|ratings|
+-----------+-----------+-----------+-------+
| 0| 0| 0| 0|
| 1| 1| 1| 1|
| 1| 1| 1| 1|
| 1| 1| 0| 0|
| 1| 1| 1| 0|
| 1| 1| 1| 1|
| 1| 1| 1| 1|
| 1| 1| 1| 1|
| 0| 1| 0| 0|
| 1| 1| 1| 1|
+-----------+-----------+-----------+-------+
This is what I tried and it is giving the correct values. I would recommend you to check the DFs you are using for joining.
add a comment |
scala> val ratingsDf = Seq((0,1),(1,2),(1,3),(0,4),(0,5),(1,6),(1,7),(1,8),(0,9),(1,10)).toDF("ratings","id")
scala> val GpredictionsDf = Seq((0,1),(1,2),(1,3),(1,4),(1,5),(1,6),(1,7),(1,8),(0,9),(1,10)).toDF("gprediction", "id")
scala> val RpredictionsDf = Seq((0,1),(1,2),(1,3),(1,4),(1,5),(1,6),(1,7),(1,8),(1,9),(1,10)).toDF("rprediction", "id")
scala> val LpredictionsDf = Seq((0,1),(1,2),(1,3),(0,4),(1,5),(1,6),(1,7),(1,8),(0,9),(1,10)).toDF("lprediction", "id")
scala> val ensembleDf = GpredictionsDf.join(RpredictionsDf, GpredictionsDf("id") === RpredictionsDf("id") ).join(LpredictionsDf, LpredictionsDf("id") === RpredictionsDf("id")).join(ratingsDf, ratingsDf("id") === RpredictionsDf("id")).select("gprediction", "rprediction", "lprediction", "ratings")
scala> ensembleDf.show
+-----------+-----------+-----------+-------+
|gprediction|rprediction|lprediction|ratings|
+-----------+-----------+-----------+-------+
| 0| 0| 0| 0|
| 1| 1| 1| 1|
| 1| 1| 1| 1|
| 1| 1| 0| 0|
| 1| 1| 1| 0|
| 1| 1| 1| 1|
| 1| 1| 1| 1|
| 1| 1| 1| 1|
| 0| 1| 0| 0|
| 1| 1| 1| 1|
+-----------+-----------+-----------+-------+
This is what I tried and it is giving the correct values. I would recommend you to check the DFs you are using for joining.
scala> val ratingsDf = Seq((0,1),(1,2),(1,3),(0,4),(0,5),(1,6),(1,7),(1,8),(0,9),(1,10)).toDF("ratings","id")
scala> val GpredictionsDf = Seq((0,1),(1,2),(1,3),(1,4),(1,5),(1,6),(1,7),(1,8),(0,9),(1,10)).toDF("gprediction", "id")
scala> val RpredictionsDf = Seq((0,1),(1,2),(1,3),(1,4),(1,5),(1,6),(1,7),(1,8),(1,9),(1,10)).toDF("rprediction", "id")
scala> val LpredictionsDf = Seq((0,1),(1,2),(1,3),(0,4),(1,5),(1,6),(1,7),(1,8),(0,9),(1,10)).toDF("lprediction", "id")
scala> val ensembleDf = GpredictionsDf.join(RpredictionsDf, GpredictionsDf("id") === RpredictionsDf("id") ).join(LpredictionsDf, LpredictionsDf("id") === RpredictionsDf("id")).join(ratingsDf, ratingsDf("id") === RpredictionsDf("id")).select("gprediction", "rprediction", "lprediction", "ratings")
scala> ensembleDf.show
+-----------+-----------+-----------+-------+
|gprediction|rprediction|lprediction|ratings|
+-----------+-----------+-----------+-------+
| 0| 0| 0| 0|
| 1| 1| 1| 1|
| 1| 1| 1| 1|
| 1| 1| 0| 0|
| 1| 1| 1| 0|
| 1| 1| 1| 1|
| 1| 1| 1| 1|
| 1| 1| 1| 1|
| 0| 1| 0| 0|
| 1| 1| 1| 1|
+-----------+-----------+-----------+-------+
This is what I tried and it is giving the correct values. I would recommend you to check the DFs you are using for joining.
answered Nov 26 '18 at 7:29
Sathiyan SSathiyan S
503310
503310
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53472667%2fdataframes-join-returns-empty-results-in-spark-scala%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Could you please follow the instructions from How to make good reproducible Apache Spark Dataframe examples and include reproducible data and Spark version? Thanks.
– user10465355
Nov 25 '18 at 23:24
I have updated it accordingly
– Nick
Nov 25 '18 at 23:43
All of these including rpredjoin and gpredjoin are dataframes only. There are no hive tables here
– Nick
Nov 26 '18 at 0:59
Your joins in Method 1 look correct except that temp views were being mixed with dataframes. Replacing
GpredictionsDf.join(rpredjoin, gpredjoin("id") === RpredictionsDf("id"))
withGpredictionsDf.join(RpredictionsDf, GpredictionsDf("id") === RpredictionsDf("id"))
should fix the problem.– Leo C
Nov 26 '18 at 1:46
I added val ensemble = GpredictionsDf.join(RpredictionsDf, GpredictionsDf("id") === RpredictionsDf("id")) .join(LpredictionsDf, LpredictionsDf("id") === RpredictionsDf("id")) .join(ratingsDf, ratingsDf("id") === RpredictionsDf("id")) .select("gprediction", "rprediction", "lprediction", "ratings"); It still shows empty dataset
– Nick
Nov 26 '18 at 3:01