How to merge a list of data.tables without getting splitted columns?

I'm about to merge large data sets. That's why I try out data.table and am thrilled by its speed.

# base R

system.time(

  M1 <- Reduce(function(...) merge(..., all=TRUE), L)

  )

# user  system elapsed 

# 5.05    0.00    5.20 



# data.table    

library(data.table)

L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id")))

system.time(

  M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)

  )

# user  system elapsed 

# 0.12    0.00    0.12

Both approaches yield the same values, however there are some columns that are split with data.table.

base R:

set.seed(1)

car::some(M1, 5)

#        sid    id         V3        V4          a         b

# 60504    1 60504 -0.6964804 -1.210195         NA        NA

# 79653    1 79653 -2.5287163 -1.087546         NA        NA

# 111637   2 11637  0.7104236        NA -1.7377657        NA

# 171855   2 71855  0.2023342        NA -0.6334279        NA

# 272460   3 72460 -0.5098994        NA         NA 0.2738896

data.table:

set.seed(1)

car::some(M2, 5)

#    sid    id       V3.x        V4      V3.y          a         V3         b

# 1:   1 60504 -0.6964804 -1.210195        NA         NA         NA        NA

# 2:   1 79653 -2.5287163 -1.087546        NA         NA         NA        NA

# 3:   2 11637         NA        NA 0.7104236 -1.7377657         NA        NA

# 4:   2 71855         NA        NA 0.2023342 -0.6334279         NA        NA

# 5:   3 72460         NA        NA        NA         NA -0.5098994 0.2738896

Did I miss something? Or is there an easy way to solve this, i.e. get the split columns combined? (I don't want to use any other packages.)

Data

fun <- function(x){

  set.seed(x)

  data.frame(cbind(sid=x, id=1:1e5, matrix(rnorm(1e5*2), 1e5)))

}

tmp <- lapply(1:3, fun)

df1 <- tmp[[1]]

df2 <- tmp[[2]]

df3 <- tmp[[3]]

rm(tmp)

names(df2)[4] <- c("a")

names(df3)[4] <- c("b")

L <- list(df1, df2, df3)

Related: 1, 2

edited Nov 23 '18 at 0:18

asked Nov 23 '18 at 0:07

jay.sf

4,63321539

add a comment |

I'm about to merge large data sets. That's why I try out data.table and am thrilled by its speed.

# base R

system.time(

  M1 <- Reduce(function(...) merge(..., all=TRUE), L)

  )

# user  system elapsed 

# 5.05    0.00    5.20 



# data.table    

library(data.table)

L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id")))

system.time(

  M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)

  )

# user  system elapsed 

# 0.12    0.00    0.12

Both approaches yield the same values, however there are some columns that are split with data.table.

base R:

set.seed(1)

car::some(M1, 5)

#        sid    id         V3        V4          a         b

# 60504    1 60504 -0.6964804 -1.210195         NA        NA

# 79653    1 79653 -2.5287163 -1.087546         NA        NA

# 111637   2 11637  0.7104236        NA -1.7377657        NA

# 171855   2 71855  0.2023342        NA -0.6334279        NA

# 272460   3 72460 -0.5098994        NA         NA 0.2738896

data.table:

set.seed(1)

car::some(M2, 5)

#    sid    id       V3.x        V4      V3.y          a         V3         b

# 1:   1 60504 -0.6964804 -1.210195        NA         NA         NA        NA

# 2:   1 79653 -2.5287163 -1.087546        NA         NA         NA        NA

# 3:   2 11637         NA        NA 0.7104236 -1.7377657         NA        NA

# 4:   2 71855         NA        NA 0.2023342 -0.6334279         NA        NA

# 5:   3 72460         NA        NA        NA         NA -0.5098994 0.2738896

Did I miss something? Or is there an easy way to solve this, i.e. get the split columns combined? (I don't want to use any other packages.)

Data

fun <- function(x){

  set.seed(x)

  data.frame(cbind(sid=x, id=1:1e5, matrix(rnorm(1e5*2), 1e5)))

}

tmp <- lapply(1:3, fun)

df1 <- tmp[[1]]

df2 <- tmp[[2]]

df3 <- tmp[[3]]

rm(tmp)

names(df2)[4] <- c("a")

names(df3)[4] <- c("b")

L <- list(df1, df2, df3)

Related: 1, 2

edited Nov 23 '18 at 0:18

asked Nov 23 '18 at 0:07

jay.sf

4,63321539

add a comment |

I'm about to merge large data sets. That's why I try out data.table and am thrilled by its speed.

# base R

system.time(

  M1 <- Reduce(function(...) merge(..., all=TRUE), L)

  )

# user  system elapsed 

# 5.05    0.00    5.20 



# data.table    

library(data.table)

L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id")))

system.time(

  M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)

  )

# user  system elapsed 

# 0.12    0.00    0.12

Both approaches yield the same values, however there are some columns that are split with data.table.

base R:

set.seed(1)

car::some(M1, 5)

#        sid    id         V3        V4          a         b

# 60504    1 60504 -0.6964804 -1.210195         NA        NA

# 79653    1 79653 -2.5287163 -1.087546         NA        NA

# 111637   2 11637  0.7104236        NA -1.7377657        NA

# 171855   2 71855  0.2023342        NA -0.6334279        NA

# 272460   3 72460 -0.5098994        NA         NA 0.2738896

data.table:

set.seed(1)

car::some(M2, 5)

#    sid    id       V3.x        V4      V3.y          a         V3         b

# 1:   1 60504 -0.6964804 -1.210195        NA         NA         NA        NA

# 2:   1 79653 -2.5287163 -1.087546        NA         NA         NA        NA

# 3:   2 11637         NA        NA 0.7104236 -1.7377657         NA        NA

# 4:   2 71855         NA        NA 0.2023342 -0.6334279         NA        NA

# 5:   3 72460         NA        NA        NA         NA -0.5098994 0.2738896

Did I miss something? Or is there an easy way to solve this, i.e. get the split columns combined? (I don't want to use any other packages.)

Data

fun <- function(x){

  set.seed(x)

  data.frame(cbind(sid=x, id=1:1e5, matrix(rnorm(1e5*2), 1e5)))

}

tmp <- lapply(1:3, fun)

df1 <- tmp[[1]]

df2 <- tmp[[2]]

df3 <- tmp[[3]]

rm(tmp)

names(df2)[4] <- c("a")

names(df3)[4] <- c("b")

L <- list(df1, df2, df3)

Related: 1, 2

edited Nov 23 '18 at 0:18

asked Nov 23 '18 at 0:07

jay.sf

4,63321539

I'm about to merge large data sets. That's why I try out data.table and am thrilled by its speed.

# base R

system.time(

  M1 <- Reduce(function(...) merge(..., all=TRUE), L)

  )

# user  system elapsed 

# 5.05    0.00    5.20 



# data.table    

library(data.table)

L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id")))

system.time(

  M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)

  )

# user  system elapsed 

# 0.12    0.00    0.12

Both approaches yield the same values, however there are some columns that are split with data.table.

base R:

set.seed(1)

car::some(M1, 5)

#        sid    id         V3        V4          a         b

# 60504    1 60504 -0.6964804 -1.210195         NA        NA

# 79653    1 79653 -2.5287163 -1.087546         NA        NA

# 111637   2 11637  0.7104236        NA -1.7377657        NA

# 171855   2 71855  0.2023342        NA -0.6334279        NA

# 272460   3 72460 -0.5098994        NA         NA 0.2738896

data.table:

set.seed(1)

car::some(M2, 5)

#    sid    id       V3.x        V4      V3.y          a         V3         b

# 1:   1 60504 -0.6964804 -1.210195        NA         NA         NA        NA

# 2:   1 79653 -2.5287163 -1.087546        NA         NA         NA        NA

# 3:   2 11637         NA        NA 0.7104236 -1.7377657         NA        NA

# 4:   2 71855         NA        NA 0.2023342 -0.6334279         NA        NA

# 5:   3 72460         NA        NA        NA         NA -0.5098994 0.2738896

Did I miss something? Or is there an easy way to solve this, i.e. get the split columns combined? (I don't want to use any other packages.)

Data

fun <- function(x){

  set.seed(x)

  data.frame(cbind(sid=x, id=1:1e5, matrix(rnorm(1e5*2), 1e5)))

}

tmp <- lapply(1:3, fun)

df1 <- tmp[[1]]

df2 <- tmp[[2]]

df3 <- tmp[[3]]

rm(tmp)

names(df2)[4] <- c("a")

names(df3)[4] <- c("b")

L <- list(df1, df2, df3)

Related: 1, 2

r merge data.table

edited Nov 23 '18 at 0:18

asked Nov 23 '18 at 0:07

jay.sf

4,63321539

edited Nov 23 '18 at 0:18

asked Nov 23 '18 at 0:07

jay.sf

4,63321539

edited Nov 23 '18 at 0:18

asked Nov 23 '18 at 0:07

jay.sf

4,63321539

asked Nov 23 '18 at 0:07

jay.sf

4,63321539

asked Nov 23 '18 at 0:07

jay.sf

4,63321539

add a comment |

1 Answer
1

active

oldest

votes

The by argument in base::merge defaults to intersect(names(x), names(y)) where x and y are the 2 tables to be merged. Hence, base::merge also uses V3 as the merging key.

The by argument in data.table::merge defaults to the shared key columns between the two tables (i.e. sid and id in this case). And since the tables have columns named V3, suffixes are appended to the new columns.

So if your intention is to merge by all common columns, you can identify the common columns, set keys then merge:

commcols <- Reduce(intersect, lapply(L, names))

L.dt <- lapply(L, function(x) setkeyv(data.table(x), commcols))

M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)

edited Nov 23 '18 at 0:36

answered Nov 23 '18 at 0:13

chinsoon12

8,71111219

Ok, in my real data, as you can imagine, there are hundreds of V3s, which is actually the point.

– jay.sf
Nov 23 '18 at 0:22

you can specify your by columns if you really want to merge by V3 as well. or L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id", "V3"))) if this is really what you intended and what you did with base::merge

– chinsoon12
Nov 23 '18 at 0:25

And this then continues with the as and bs and cs and ... from the dts to merge.

– jay.sf
Nov 23 '18 at 0:26

just identify the common columns using by=Reduce(intersect, lapply(L.dt, names))

– chinsoon12
Nov 23 '18 at 0:27

I have a follow-up issue up with the real data, though. Error message x has some duplicated column name(s) citing 118 names, data has 431 names. Thought it was related to intersect and tried union but didn't help. Could you clarify that here, or shall I rather ask another question?

– jay.sf
Nov 23 '18 at 0:56

|
show 2 more comments

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53439279%2fhow-to-merge-a-list-of-data-tables-without-getting-splitted-columns%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

The by argument in base::merge defaults to intersect(names(x), names(y)) where x and y are the 2 tables to be merged. Hence, base::merge also uses V3 as the merging key.

So if your intention is to merge by all common columns, you can identify the common columns, set keys then merge:

commcols <- Reduce(intersect, lapply(L, names))

L.dt <- lapply(L, function(x) setkeyv(data.table(x), commcols))

M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)

edited Nov 23 '18 at 0:36

answered Nov 23 '18 at 0:13

chinsoon12

8,71111219

Ok, in my real data, as you can imagine, there are hundreds of V3s, which is actually the point.

– jay.sf
Nov 23 '18 at 0:22

you can specify your by columns if you really want to merge by V3 as well. or L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id", "V3"))) if this is really what you intended and what you did with base::merge

– chinsoon12
Nov 23 '18 at 0:25

And this then continues with the as and bs and cs and ... from the dts to merge.

– jay.sf
Nov 23 '18 at 0:26

just identify the common columns using by=Reduce(intersect, lapply(L.dt, names))

– chinsoon12
Nov 23 '18 at 0:27

I have a follow-up issue up with the real data, though. Error message x has some duplicated column name(s) citing 118 names, data has 431 names. Thought it was related to intersect and tried union but didn't help. Could you clarify that here, or shall I rather ask another question?

– jay.sf
Nov 23 '18 at 0:56

|
show 2 more comments

The by argument in base::merge defaults to intersect(names(x), names(y)) where x and y are the 2 tables to be merged. Hence, base::merge also uses V3 as the merging key.

So if your intention is to merge by all common columns, you can identify the common columns, set keys then merge:

commcols <- Reduce(intersect, lapply(L, names))

L.dt <- lapply(L, function(x) setkeyv(data.table(x), commcols))

M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)

edited Nov 23 '18 at 0:36

answered Nov 23 '18 at 0:13

chinsoon12

8,71111219

Ok, in my real data, as you can imagine, there are hundreds of V3s, which is actually the point.

– jay.sf
Nov 23 '18 at 0:22

you can specify your by columns if you really want to merge by V3 as well. or L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id", "V3"))) if this is really what you intended and what you did with base::merge

– chinsoon12
Nov 23 '18 at 0:25

And this then continues with the as and bs and cs and ... from the dts to merge.

– jay.sf
Nov 23 '18 at 0:26

just identify the common columns using by=Reduce(intersect, lapply(L.dt, names))

– chinsoon12
Nov 23 '18 at 0:27

I have a follow-up issue up with the real data, though. Error message x has some duplicated column name(s) citing 118 names, data has 431 names. Thought it was related to intersect and tried union but didn't help. Could you clarify that here, or shall I rather ask another question?

– jay.sf
Nov 23 '18 at 0:56

|
show 2 more comments

The by argument in base::merge defaults to intersect(names(x), names(y)) where x and y are the 2 tables to be merged. Hence, base::merge also uses V3 as the merging key.

So if your intention is to merge by all common columns, you can identify the common columns, set keys then merge:

commcols <- Reduce(intersect, lapply(L, names))

L.dt <- lapply(L, function(x) setkeyv(data.table(x), commcols))

M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)

edited Nov 23 '18 at 0:36

answered Nov 23 '18 at 0:13

chinsoon12

8,71111219

The by argument in base::merge defaults to intersect(names(x), names(y)) where x and y are the 2 tables to be merged. Hence, base::merge also uses V3 as the merging key.

So if your intention is to merge by all common columns, you can identify the common columns, set keys then merge:

commcols <- Reduce(intersect, lapply(L, names))

L.dt <- lapply(L, function(x) setkeyv(data.table(x), commcols))

M2 <- Reduce(function(...) merge(..., all=TRUE), L.dt)

edited Nov 23 '18 at 0:36

answered Nov 23 '18 at 0:13

chinsoon12

8,71111219

edited Nov 23 '18 at 0:36

answered Nov 23 '18 at 0:13

chinsoon12

8,71111219

answered Nov 23 '18 at 0:13

chinsoon12

8,71111219

answered Nov 23 '18 at 0:13

chinsoon12

8,71111219

Ok, in my real data, as you can imagine, there are hundreds of V3s, which is actually the point.

– jay.sf
Nov 23 '18 at 0:22

you can specify your by columns if you really want to merge by V3 as well. or L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id", "V3"))) if this is really what you intended and what you did with base::merge

– chinsoon12
Nov 23 '18 at 0:25

And this then continues with the as and bs and cs and ... from the dts to merge.

– jay.sf
Nov 23 '18 at 0:26

just identify the common columns using by=Reduce(intersect, lapply(L.dt, names))

– chinsoon12
Nov 23 '18 at 0:27

I have a follow-up issue up with the real data, though. Error message x has some duplicated column name(s) citing 118 names, data has 431 names. Thought it was related to intersect and tried union but didn't help. Could you clarify that here, or shall I rather ask another question?

– jay.sf
Nov 23 '18 at 0:56

|
show 2 more comments

Ok, in my real data, as you can imagine, there are hundreds of V3s, which is actually the point.

– jay.sf
Nov 23 '18 at 0:22

you can specify your by columns if you really want to merge by V3 as well. or L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id", "V3"))) if this is really what you intended and what you did with base::merge

– chinsoon12
Nov 23 '18 at 0:25

And this then continues with the as and bs and cs and ... from the dts to merge.

– jay.sf
Nov 23 '18 at 0:26

just identify the common columns using by=Reduce(intersect, lapply(L.dt, names))

– chinsoon12
Nov 23 '18 at 0:27

I have a follow-up issue up with the real data, though. Error message x has some duplicated column name(s) citing 118 names, data has 431 names. Thought it was related to intersect and tried union but didn't help. Could you clarify that here, or shall I rather ask another question?

– jay.sf
Nov 23 '18 at 0:56

Ok, in my real data, as you can imagine, there are hundreds of V3s, which is actually the point.

– jay.sf
Nov 23 '18 at 0:22

you can specify your by columns if you really want to merge by V3 as well. or L.dt <- lapply(L, function(x) setkeyv(data.table(x), c("sid", "id", "V3"))) if this is really what you intended and what you did with base::merge

– chinsoon12
Nov 23 '18 at 0:25

And this then continues with the as and bs and cs and ... from the dts to merge.

– jay.sf
Nov 23 '18 at 0:26

just identify the common columns using by=Reduce(intersect, lapply(L.dt, names))

– chinsoon12
Nov 23 '18 at 0:27

I have a follow-up issue up with the real data, though. Error message x has some duplicated column name(s) citing 118 names, data has 431 names. Thought it was related to intersect and tried union but didn't help. Could you clarify that here, or shall I rather ask another question?

– jay.sf
Nov 23 '18 at 0:56

|
show 2 more comments

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Tukukkk