How to merge a list of data.tables without getting splitted columns?
I'm about to merge large data sets. That's why I try out data.table and am thrilled by its speed.
# base R
system.time(
M1 <- Reduce(function(...) merge(..., all=TRUE), L)
)
# user system elapsed
# 5.05 0.00 5.20
# data.table
library(data.table)
L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id")))
system.time(
M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)
)
# user system elapsed
# 0.12 0.00 0.12
Both approaches yield the same values, however there are some columns that are split with data.table.
base R:
set.seed(1)
car::some(M1, 5)
# sid id V3 V4 a b
# 60504 1 60504 -0.6964804 -1.210195 NA NA
# 79653 1 79653 -2.5287163 -1.087546 NA NA
# 111637 2 11637 0.7104236 NA -1.7377657 NA
# 171855 2 71855 0.2023342 NA -0.6334279 NA
# 272460 3 72460 -0.5098994 NA NA 0.2738896
data.table:
set.seed(1)
car::some(M2, 5)
# sid id V3.x V4 V3.y a V3 b
# 1: 1 60504 -0.6964804 -1.210195 NA NA NA NA
# 2: 1 79653 -2.5287163 -1.087546 NA NA NA NA
# 3: 2 11637 NA NA 0.7104236 -1.7377657 NA NA
# 4: 2 71855 NA NA 0.2023342 -0.6334279 NA NA
# 5: 3 72460 NA NA NA NA -0.5098994 0.2738896
Did I miss something? Or is there an easy way to solve this, i.e. get the split columns combined? (I don't want to use any other packages.)
Data
fun <- function(x){
set.seed(x)
data.frame(cbind(sid=x, id=1:1e5, matrix(rnorm(1e5*2), 1e5)))
}
tmp <- lapply(1:3, fun)
df1 <- tmp[[1]]
df2 <- tmp[[2]]
df3 <- tmp[[3]]
rm(tmp)
names(df2)[4] <- c("a")
names(df3)[4] <- c("b")
L <- list(df1, df2, df3)
Related: 1, 2
r merge data.table
add a comment |
I'm about to merge large data sets. That's why I try out data.table and am thrilled by its speed.
# base R
system.time(
M1 <- Reduce(function(...) merge(..., all=TRUE), L)
)
# user system elapsed
# 5.05 0.00 5.20
# data.table
library(data.table)
L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id")))
system.time(
M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)
)
# user system elapsed
# 0.12 0.00 0.12
Both approaches yield the same values, however there are some columns that are split with data.table.
base R:
set.seed(1)
car::some(M1, 5)
# sid id V3 V4 a b
# 60504 1 60504 -0.6964804 -1.210195 NA NA
# 79653 1 79653 -2.5287163 -1.087546 NA NA
# 111637 2 11637 0.7104236 NA -1.7377657 NA
# 171855 2 71855 0.2023342 NA -0.6334279 NA
# 272460 3 72460 -0.5098994 NA NA 0.2738896
data.table:
set.seed(1)
car::some(M2, 5)
# sid id V3.x V4 V3.y a V3 b
# 1: 1 60504 -0.6964804 -1.210195 NA NA NA NA
# 2: 1 79653 -2.5287163 -1.087546 NA NA NA NA
# 3: 2 11637 NA NA 0.7104236 -1.7377657 NA NA
# 4: 2 71855 NA NA 0.2023342 -0.6334279 NA NA
# 5: 3 72460 NA NA NA NA -0.5098994 0.2738896
Did I miss something? Or is there an easy way to solve this, i.e. get the split columns combined? (I don't want to use any other packages.)
Data
fun <- function(x){
set.seed(x)
data.frame(cbind(sid=x, id=1:1e5, matrix(rnorm(1e5*2), 1e5)))
}
tmp <- lapply(1:3, fun)
df1 <- tmp[[1]]
df2 <- tmp[[2]]
df3 <- tmp[[3]]
rm(tmp)
names(df2)[4] <- c("a")
names(df3)[4] <- c("b")
L <- list(df1, df2, df3)
Related: 1, 2
r merge data.table
add a comment |
I'm about to merge large data sets. That's why I try out data.table and am thrilled by its speed.
# base R
system.time(
M1 <- Reduce(function(...) merge(..., all=TRUE), L)
)
# user system elapsed
# 5.05 0.00 5.20
# data.table
library(data.table)
L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id")))
system.time(
M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)
)
# user system elapsed
# 0.12 0.00 0.12
Both approaches yield the same values, however there are some columns that are split with data.table.
base R:
set.seed(1)
car::some(M1, 5)
# sid id V3 V4 a b
# 60504 1 60504 -0.6964804 -1.210195 NA NA
# 79653 1 79653 -2.5287163 -1.087546 NA NA
# 111637 2 11637 0.7104236 NA -1.7377657 NA
# 171855 2 71855 0.2023342 NA -0.6334279 NA
# 272460 3 72460 -0.5098994 NA NA 0.2738896
data.table:
set.seed(1)
car::some(M2, 5)
# sid id V3.x V4 V3.y a V3 b
# 1: 1 60504 -0.6964804 -1.210195 NA NA NA NA
# 2: 1 79653 -2.5287163 -1.087546 NA NA NA NA
# 3: 2 11637 NA NA 0.7104236 -1.7377657 NA NA
# 4: 2 71855 NA NA 0.2023342 -0.6334279 NA NA
# 5: 3 72460 NA NA NA NA -0.5098994 0.2738896
Did I miss something? Or is there an easy way to solve this, i.e. get the split columns combined? (I don't want to use any other packages.)
Data
fun <- function(x){
set.seed(x)
data.frame(cbind(sid=x, id=1:1e5, matrix(rnorm(1e5*2), 1e5)))
}
tmp <- lapply(1:3, fun)
df1 <- tmp[[1]]
df2 <- tmp[[2]]
df3 <- tmp[[3]]
rm(tmp)
names(df2)[4] <- c("a")
names(df3)[4] <- c("b")
L <- list(df1, df2, df3)
Related: 1, 2
r merge data.table
I'm about to merge large data sets. That's why I try out data.table and am thrilled by its speed.
# base R
system.time(
M1 <- Reduce(function(...) merge(..., all=TRUE), L)
)
# user system elapsed
# 5.05 0.00 5.20
# data.table
library(data.table)
L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id")))
system.time(
M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)
)
# user system elapsed
# 0.12 0.00 0.12
Both approaches yield the same values, however there are some columns that are split with data.table.
base R:
set.seed(1)
car::some(M1, 5)
# sid id V3 V4 a b
# 60504 1 60504 -0.6964804 -1.210195 NA NA
# 79653 1 79653 -2.5287163 -1.087546 NA NA
# 111637 2 11637 0.7104236 NA -1.7377657 NA
# 171855 2 71855 0.2023342 NA -0.6334279 NA
# 272460 3 72460 -0.5098994 NA NA 0.2738896
data.table:
set.seed(1)
car::some(M2, 5)
# sid id V3.x V4 V3.y a V3 b
# 1: 1 60504 -0.6964804 -1.210195 NA NA NA NA
# 2: 1 79653 -2.5287163 -1.087546 NA NA NA NA
# 3: 2 11637 NA NA 0.7104236 -1.7377657 NA NA
# 4: 2 71855 NA NA 0.2023342 -0.6334279 NA NA
# 5: 3 72460 NA NA NA NA -0.5098994 0.2738896
Did I miss something? Or is there an easy way to solve this, i.e. get the split columns combined? (I don't want to use any other packages.)
Data
fun <- function(x){
set.seed(x)
data.frame(cbind(sid=x, id=1:1e5, matrix(rnorm(1e5*2), 1e5)))
}
tmp <- lapply(1:3, fun)
df1 <- tmp[[1]]
df2 <- tmp[[2]]
df3 <- tmp[[3]]
rm(tmp)
names(df2)[4] <- c("a")
names(df3)[4] <- c("b")
L <- list(df1, df2, df3)
Related: 1, 2
r merge data.table
r merge data.table
edited Nov 23 '18 at 0:18
jay.sf
asked Nov 23 '18 at 0:07
jay.sfjay.sf
4,63321539
4,63321539
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
The by
argument in base::merge
defaults to intersect(names(x), names(y))
where x
and y
are the 2 tables to be merged. Hence, base::merge
also uses V3
as the merging key.
The by
argument in data.table::merge
defaults to the shared key columns between the two tables (i.e. sid
and id
in this case). And since the tables have columns named V3
, suffixes are appended to the new columns.
So if your intention is to merge by all common columns, you can identify the common columns, set keys then merge:
commcols <- Reduce(intersect, lapply(L, names))
L.dt <- lapply(L, function(x) setkeyv(data.table(x), commcols))
M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)
Ok, in my real data, as you can imagine, there are hundreds ofV3
s, which is actually the point.
– jay.sf
Nov 23 '18 at 0:22
you can specify yourby
columns if you really want to merge by V3 as well. orL.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id", "V3")))
if this is really what you intended and what you did withbase::merge
– chinsoon12
Nov 23 '18 at 0:25
And this then continues with thea
s andb
s andc
s and ... from thedt
s to merge.
– jay.sf
Nov 23 '18 at 0:26
just identify the common columns usingby=Reduce(intersect, lapply(L.dt, names))
– chinsoon12
Nov 23 '18 at 0:27
I have a follow-up issue up with the real data, though. Error messagex has some duplicated column name(s)
citing 118 names, data has 431 names. Thought it was related tointersect
and triedunion
but didn't help. Could you clarify that here, or shall I rather ask another question?
– jay.sf
Nov 23 '18 at 0:56
|
show 2 more comments
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53439279%2fhow-to-merge-a-list-of-data-tables-without-getting-splitted-columns%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The by
argument in base::merge
defaults to intersect(names(x), names(y))
where x
and y
are the 2 tables to be merged. Hence, base::merge
also uses V3
as the merging key.
The by
argument in data.table::merge
defaults to the shared key columns between the two tables (i.e. sid
and id
in this case). And since the tables have columns named V3
, suffixes are appended to the new columns.
So if your intention is to merge by all common columns, you can identify the common columns, set keys then merge:
commcols <- Reduce(intersect, lapply(L, names))
L.dt <- lapply(L, function(x) setkeyv(data.table(x), commcols))
M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)
Ok, in my real data, as you can imagine, there are hundreds ofV3
s, which is actually the point.
– jay.sf
Nov 23 '18 at 0:22
you can specify yourby
columns if you really want to merge by V3 as well. orL.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id", "V3")))
if this is really what you intended and what you did withbase::merge
– chinsoon12
Nov 23 '18 at 0:25
And this then continues with thea
s andb
s andc
s and ... from thedt
s to merge.
– jay.sf
Nov 23 '18 at 0:26
just identify the common columns usingby=Reduce(intersect, lapply(L.dt, names))
– chinsoon12
Nov 23 '18 at 0:27
I have a follow-up issue up with the real data, though. Error messagex has some duplicated column name(s)
citing 118 names, data has 431 names. Thought it was related tointersect
and triedunion
but didn't help. Could you clarify that here, or shall I rather ask another question?
– jay.sf
Nov 23 '18 at 0:56
|
show 2 more comments
The by
argument in base::merge
defaults to intersect(names(x), names(y))
where x
and y
are the 2 tables to be merged. Hence, base::merge
also uses V3
as the merging key.
The by
argument in data.table::merge
defaults to the shared key columns between the two tables (i.e. sid
and id
in this case). And since the tables have columns named V3
, suffixes are appended to the new columns.
So if your intention is to merge by all common columns, you can identify the common columns, set keys then merge:
commcols <- Reduce(intersect, lapply(L, names))
L.dt <- lapply(L, function(x) setkeyv(data.table(x), commcols))
M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)
Ok, in my real data, as you can imagine, there are hundreds ofV3
s, which is actually the point.
– jay.sf
Nov 23 '18 at 0:22
you can specify yourby
columns if you really want to merge by V3 as well. orL.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id", "V3")))
if this is really what you intended and what you did withbase::merge
– chinsoon12
Nov 23 '18 at 0:25
And this then continues with thea
s andb
s andc
s and ... from thedt
s to merge.
– jay.sf
Nov 23 '18 at 0:26
just identify the common columns usingby=Reduce(intersect, lapply(L.dt, names))
– chinsoon12
Nov 23 '18 at 0:27
I have a follow-up issue up with the real data, though. Error messagex has some duplicated column name(s)
citing 118 names, data has 431 names. Thought it was related tointersect
and triedunion
but didn't help. Could you clarify that here, or shall I rather ask another question?
– jay.sf
Nov 23 '18 at 0:56
|
show 2 more comments
The by
argument in base::merge
defaults to intersect(names(x), names(y))
where x
and y
are the 2 tables to be merged. Hence, base::merge
also uses V3
as the merging key.
The by
argument in data.table::merge
defaults to the shared key columns between the two tables (i.e. sid
and id
in this case). And since the tables have columns named V3
, suffixes are appended to the new columns.
So if your intention is to merge by all common columns, you can identify the common columns, set keys then merge:
commcols <- Reduce(intersect, lapply(L, names))
L.dt <- lapply(L, function(x) setkeyv(data.table(x), commcols))
M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)
The by
argument in base::merge
defaults to intersect(names(x), names(y))
where x
and y
are the 2 tables to be merged. Hence, base::merge
also uses V3
as the merging key.
The by
argument in data.table::merge
defaults to the shared key columns between the two tables (i.e. sid
and id
in this case). And since the tables have columns named V3
, suffixes are appended to the new columns.
So if your intention is to merge by all common columns, you can identify the common columns, set keys then merge:
commcols <- Reduce(intersect, lapply(L, names))
L.dt <- lapply(L, function(x) setkeyv(data.table(x), commcols))
M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)
edited Nov 23 '18 at 0:36
answered Nov 23 '18 at 0:13
chinsoon12chinsoon12
8,71111219
8,71111219
Ok, in my real data, as you can imagine, there are hundreds ofV3
s, which is actually the point.
– jay.sf
Nov 23 '18 at 0:22
you can specify yourby
columns if you really want to merge by V3 as well. orL.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id", "V3")))
if this is really what you intended and what you did withbase::merge
– chinsoon12
Nov 23 '18 at 0:25
And this then continues with thea
s andb
s andc
s and ... from thedt
s to merge.
– jay.sf
Nov 23 '18 at 0:26
just identify the common columns usingby=Reduce(intersect, lapply(L.dt, names))
– chinsoon12
Nov 23 '18 at 0:27
I have a follow-up issue up with the real data, though. Error messagex has some duplicated column name(s)
citing 118 names, data has 431 names. Thought it was related tointersect
and triedunion
but didn't help. Could you clarify that here, or shall I rather ask another question?
– jay.sf
Nov 23 '18 at 0:56
|
show 2 more comments
Ok, in my real data, as you can imagine, there are hundreds ofV3
s, which is actually the point.
– jay.sf
Nov 23 '18 at 0:22
you can specify yourby
columns if you really want to merge by V3 as well. orL.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id", "V3")))
if this is really what you intended and what you did withbase::merge
– chinsoon12
Nov 23 '18 at 0:25
And this then continues with thea
s andb
s andc
s and ... from thedt
s to merge.
– jay.sf
Nov 23 '18 at 0:26
just identify the common columns usingby=Reduce(intersect, lapply(L.dt, names))
– chinsoon12
Nov 23 '18 at 0:27
I have a follow-up issue up with the real data, though. Error messagex has some duplicated column name(s)
citing 118 names, data has 431 names. Thought it was related tointersect
and triedunion
but didn't help. Could you clarify that here, or shall I rather ask another question?
– jay.sf
Nov 23 '18 at 0:56
Ok, in my real data, as you can imagine, there are hundreds of
V3
s, which is actually the point.– jay.sf
Nov 23 '18 at 0:22
Ok, in my real data, as you can imagine, there are hundreds of
V3
s, which is actually the point.– jay.sf
Nov 23 '18 at 0:22
you can specify your
by
columns if you really want to merge by V3 as well. or L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id", "V3")))
if this is really what you intended and what you did with base::merge
– chinsoon12
Nov 23 '18 at 0:25
you can specify your
by
columns if you really want to merge by V3 as well. or L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id", "V3")))
if this is really what you intended and what you did with base::merge
– chinsoon12
Nov 23 '18 at 0:25
And this then continues with the
a
s and b
s and c
s and ... from the dt
s to merge.– jay.sf
Nov 23 '18 at 0:26
And this then continues with the
a
s and b
s and c
s and ... from the dt
s to merge.– jay.sf
Nov 23 '18 at 0:26
just identify the common columns using
by=Reduce(intersect, lapply(L.dt, names))
– chinsoon12
Nov 23 '18 at 0:27
just identify the common columns using
by=Reduce(intersect, lapply(L.dt, names))
– chinsoon12
Nov 23 '18 at 0:27
I have a follow-up issue up with the real data, though. Error message
x has some duplicated column name(s)
citing 118 names, data has 431 names. Thought it was related to intersect
and tried union
but didn't help. Could you clarify that here, or shall I rather ask another question?– jay.sf
Nov 23 '18 at 0:56
I have a follow-up issue up with the real data, though. Error message
x has some duplicated column name(s)
citing 118 names, data has 431 names. Thought it was related to intersect
and tried union
but didn't help. Could you clarify that here, or shall I rather ask another question?– jay.sf
Nov 23 '18 at 0:56
|
show 2 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53439279%2fhow-to-merge-a-list-of-data-tables-without-getting-splitted-columns%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown