how to split column of tuples in pandas dataframe?
I have a pandas dataframe (this is only a little piece)
>>> d1
y norm test y norm train len(y_train) len(y_test)
0 64.904368 116.151232 1645 549
1 70.852681 112.639876 1645 549
SVR RBF
0 (35.652207342877873, 22.95533537448393)
1 (39.563683797747622, 27.382483096332511)
LCV
0 (19.365430594452338, 13.880062435173587)
1 (19.099614489458364, 14.018867136617146)
RIDGE CV
0 (4.2907610988480362, 12.416745648065584)
1 (4.18864306788194, 12.980833914392477)
RF
0 (9.9484841581029428, 16.46902345373697)
1 (10.139848213735391, 16.282141345406522)
GB
0 (0.012816232716538605, 15.950164822266007)
1 (0.012814519804493328, 15.305745202851712)
ET DATA
0 (0.00034337162272515505, 16.284800366214057) j2m
1 (0.00024811554516431878, 15.556506191784194) j2m
>>>
I want to split all the columns that contain tuples. For example I want to replace the column LCV
with the columns LCV-a
and LCV-b
.
How can I do that?
python numpy pandas dataframe tuples
add a comment |
I have a pandas dataframe (this is only a little piece)
>>> d1
y norm test y norm train len(y_train) len(y_test)
0 64.904368 116.151232 1645 549
1 70.852681 112.639876 1645 549
SVR RBF
0 (35.652207342877873, 22.95533537448393)
1 (39.563683797747622, 27.382483096332511)
LCV
0 (19.365430594452338, 13.880062435173587)
1 (19.099614489458364, 14.018867136617146)
RIDGE CV
0 (4.2907610988480362, 12.416745648065584)
1 (4.18864306788194, 12.980833914392477)
RF
0 (9.9484841581029428, 16.46902345373697)
1 (10.139848213735391, 16.282141345406522)
GB
0 (0.012816232716538605, 15.950164822266007)
1 (0.012814519804493328, 15.305745202851712)
ET DATA
0 (0.00034337162272515505, 16.284800366214057) j2m
1 (0.00024811554516431878, 15.556506191784194) j2m
>>>
I want to split all the columns that contain tuples. For example I want to replace the column LCV
with the columns LCV-a
and LCV-b
.
How can I do that?
python numpy pandas dataframe tuples
add a comment |
I have a pandas dataframe (this is only a little piece)
>>> d1
y norm test y norm train len(y_train) len(y_test)
0 64.904368 116.151232 1645 549
1 70.852681 112.639876 1645 549
SVR RBF
0 (35.652207342877873, 22.95533537448393)
1 (39.563683797747622, 27.382483096332511)
LCV
0 (19.365430594452338, 13.880062435173587)
1 (19.099614489458364, 14.018867136617146)
RIDGE CV
0 (4.2907610988480362, 12.416745648065584)
1 (4.18864306788194, 12.980833914392477)
RF
0 (9.9484841581029428, 16.46902345373697)
1 (10.139848213735391, 16.282141345406522)
GB
0 (0.012816232716538605, 15.950164822266007)
1 (0.012814519804493328, 15.305745202851712)
ET DATA
0 (0.00034337162272515505, 16.284800366214057) j2m
1 (0.00024811554516431878, 15.556506191784194) j2m
>>>
I want to split all the columns that contain tuples. For example I want to replace the column LCV
with the columns LCV-a
and LCV-b
.
How can I do that?
python numpy pandas dataframe tuples
I have a pandas dataframe (this is only a little piece)
>>> d1
y norm test y norm train len(y_train) len(y_test)
0 64.904368 116.151232 1645 549
1 70.852681 112.639876 1645 549
SVR RBF
0 (35.652207342877873, 22.95533537448393)
1 (39.563683797747622, 27.382483096332511)
LCV
0 (19.365430594452338, 13.880062435173587)
1 (19.099614489458364, 14.018867136617146)
RIDGE CV
0 (4.2907610988480362, 12.416745648065584)
1 (4.18864306788194, 12.980833914392477)
RF
0 (9.9484841581029428, 16.46902345373697)
1 (10.139848213735391, 16.282141345406522)
GB
0 (0.012816232716538605, 15.950164822266007)
1 (0.012814519804493328, 15.305745202851712)
ET DATA
0 (0.00034337162272515505, 16.284800366214057) j2m
1 (0.00024811554516431878, 15.556506191784194) j2m
>>>
I want to split all the columns that contain tuples. For example I want to replace the column LCV
with the columns LCV-a
and LCV-b
.
How can I do that?
python numpy pandas dataframe tuples
python numpy pandas dataframe tuples
edited Jun 8 at 14:08
MERose
1,68122849
1,68122849
asked Apr 9 '15 at 22:50
Donbeo
4,8312166118
4,8312166118
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
You can do this by apply(pd.Series)
on that column:
In [13]: df = pd.DataFrame({'a':[1,2], 'b':[(1,2), (3,4)]})
In [14]: df
Out[14]:
a b
0 1 (1, 2)
1 2 (3, 4)
In [16]: df['b'].apply(pd.Series)
Out[16]:
0 1
0 1 2
1 3 4
In [17]: df[['b1', 'b2']] = df['b'].apply(pd.Series)
In [18]: df
Out[18]:
a b b1 b2
0 1 (1, 2) 1 2
1 2 (3, 4) 3 4
This works because it makes of each tuple a Series, which is then seen as a row of a dataframe.
is there a way to automate it due to the large number of columns?
– Donbeo
Apr 9 '15 at 22:56
Not directly I think. But you can easily write a function for it using the above code (+ removing the original one)
– joris
Apr 9 '15 at 22:58
wait it is not working to me. I update the question
– Donbeo
Apr 9 '15 at 22:59
If you have a large number of columns you may want to consider to 'tidy' your data: vita.had.co.nz/papers/tidy-data.html You can do this using the melt function.
– Axel
Feb 15 at 18:49
add a comment |
On much larger datasets, I found that .apply()
is few orders slower than pd.DataFrame(df['b'].values.tolist(), index=df.index)
This performance issue was closed in GitHub, although I do not agree with this decision:
https://github.com/pandas-dev/pandas/issues/11615
EDIT: based on this answer: https://stackoverflow.com/a/44196843/2230844
3
pd.DataFrame(df['b'].tolist())
without the.values
seems to work just fine too. (And thanks, your solution is much faster than.apply()
)
– Swier
Sep 19 '16 at 7:41
I was worried about capturing index, hence explicit usage of .values.
– denfromufa
Sep 20 '16 at 3:17
1
never use apply if performance is an issue!
– Mike Palmice
Jun 5 at 13:37
add a comment |
I know this is from a while ago, but a caveat of the second solution:
pd.DataFrame(df['b'].values.tolist())
is that it will explicitly discard the index, and add in a default sequential index, whereas the accepted answer
apply(pd.Series)
will not, since the result of apply will retain the row index. While the order is initially retained from the original array, pandas will try to match the indicies from the two dataframes.
This can be very important if you are trying to set the rows into an numerically indexed array, and pandas will automatically try to match the index of the new array to the old, and cause some distortion in the ordering.
A better hybrid solution would be to set the index of the original dataframe onto the new, i.e.
pd.DataFrame(df['b'].values.tolist(), index=df.index)
Which will retain the speed of using the second method while ensuring the order and indexing is retained on the result.
I edited my answer based on your indexing observation, thanks!
– denfromufa
Nov 20 at 20:16
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f29550414%2fhow-to-split-column-of-tuples-in-pandas-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
You can do this by apply(pd.Series)
on that column:
In [13]: df = pd.DataFrame({'a':[1,2], 'b':[(1,2), (3,4)]})
In [14]: df
Out[14]:
a b
0 1 (1, 2)
1 2 (3, 4)
In [16]: df['b'].apply(pd.Series)
Out[16]:
0 1
0 1 2
1 3 4
In [17]: df[['b1', 'b2']] = df['b'].apply(pd.Series)
In [18]: df
Out[18]:
a b b1 b2
0 1 (1, 2) 1 2
1 2 (3, 4) 3 4
This works because it makes of each tuple a Series, which is then seen as a row of a dataframe.
is there a way to automate it due to the large number of columns?
– Donbeo
Apr 9 '15 at 22:56
Not directly I think. But you can easily write a function for it using the above code (+ removing the original one)
– joris
Apr 9 '15 at 22:58
wait it is not working to me. I update the question
– Donbeo
Apr 9 '15 at 22:59
If you have a large number of columns you may want to consider to 'tidy' your data: vita.had.co.nz/papers/tidy-data.html You can do this using the melt function.
– Axel
Feb 15 at 18:49
add a comment |
You can do this by apply(pd.Series)
on that column:
In [13]: df = pd.DataFrame({'a':[1,2], 'b':[(1,2), (3,4)]})
In [14]: df
Out[14]:
a b
0 1 (1, 2)
1 2 (3, 4)
In [16]: df['b'].apply(pd.Series)
Out[16]:
0 1
0 1 2
1 3 4
In [17]: df[['b1', 'b2']] = df['b'].apply(pd.Series)
In [18]: df
Out[18]:
a b b1 b2
0 1 (1, 2) 1 2
1 2 (3, 4) 3 4
This works because it makes of each tuple a Series, which is then seen as a row of a dataframe.
is there a way to automate it due to the large number of columns?
– Donbeo
Apr 9 '15 at 22:56
Not directly I think. But you can easily write a function for it using the above code (+ removing the original one)
– joris
Apr 9 '15 at 22:58
wait it is not working to me. I update the question
– Donbeo
Apr 9 '15 at 22:59
If you have a large number of columns you may want to consider to 'tidy' your data: vita.had.co.nz/papers/tidy-data.html You can do this using the melt function.
– Axel
Feb 15 at 18:49
add a comment |
You can do this by apply(pd.Series)
on that column:
In [13]: df = pd.DataFrame({'a':[1,2], 'b':[(1,2), (3,4)]})
In [14]: df
Out[14]:
a b
0 1 (1, 2)
1 2 (3, 4)
In [16]: df['b'].apply(pd.Series)
Out[16]:
0 1
0 1 2
1 3 4
In [17]: df[['b1', 'b2']] = df['b'].apply(pd.Series)
In [18]: df
Out[18]:
a b b1 b2
0 1 (1, 2) 1 2
1 2 (3, 4) 3 4
This works because it makes of each tuple a Series, which is then seen as a row of a dataframe.
You can do this by apply(pd.Series)
on that column:
In [13]: df = pd.DataFrame({'a':[1,2], 'b':[(1,2), (3,4)]})
In [14]: df
Out[14]:
a b
0 1 (1, 2)
1 2 (3, 4)
In [16]: df['b'].apply(pd.Series)
Out[16]:
0 1
0 1 2
1 3 4
In [17]: df[['b1', 'b2']] = df['b'].apply(pd.Series)
In [18]: df
Out[18]:
a b b1 b2
0 1 (1, 2) 1 2
1 2 (3, 4) 3 4
This works because it makes of each tuple a Series, which is then seen as a row of a dataframe.
edited Apr 9 '15 at 22:57
answered Apr 9 '15 at 22:55
joris
60.7k18154148
60.7k18154148
is there a way to automate it due to the large number of columns?
– Donbeo
Apr 9 '15 at 22:56
Not directly I think. But you can easily write a function for it using the above code (+ removing the original one)
– joris
Apr 9 '15 at 22:58
wait it is not working to me. I update the question
– Donbeo
Apr 9 '15 at 22:59
If you have a large number of columns you may want to consider to 'tidy' your data: vita.had.co.nz/papers/tidy-data.html You can do this using the melt function.
– Axel
Feb 15 at 18:49
add a comment |
is there a way to automate it due to the large number of columns?
– Donbeo
Apr 9 '15 at 22:56
Not directly I think. But you can easily write a function for it using the above code (+ removing the original one)
– joris
Apr 9 '15 at 22:58
wait it is not working to me. I update the question
– Donbeo
Apr 9 '15 at 22:59
If you have a large number of columns you may want to consider to 'tidy' your data: vita.had.co.nz/papers/tidy-data.html You can do this using the melt function.
– Axel
Feb 15 at 18:49
is there a way to automate it due to the large number of columns?
– Donbeo
Apr 9 '15 at 22:56
is there a way to automate it due to the large number of columns?
– Donbeo
Apr 9 '15 at 22:56
Not directly I think. But you can easily write a function for it using the above code (+ removing the original one)
– joris
Apr 9 '15 at 22:58
Not directly I think. But you can easily write a function for it using the above code (+ removing the original one)
– joris
Apr 9 '15 at 22:58
wait it is not working to me. I update the question
– Donbeo
Apr 9 '15 at 22:59
wait it is not working to me. I update the question
– Donbeo
Apr 9 '15 at 22:59
If you have a large number of columns you may want to consider to 'tidy' your data: vita.had.co.nz/papers/tidy-data.html You can do this using the melt function.
– Axel
Feb 15 at 18:49
If you have a large number of columns you may want to consider to 'tidy' your data: vita.had.co.nz/papers/tidy-data.html You can do this using the melt function.
– Axel
Feb 15 at 18:49
add a comment |
On much larger datasets, I found that .apply()
is few orders slower than pd.DataFrame(df['b'].values.tolist(), index=df.index)
This performance issue was closed in GitHub, although I do not agree with this decision:
https://github.com/pandas-dev/pandas/issues/11615
EDIT: based on this answer: https://stackoverflow.com/a/44196843/2230844
3
pd.DataFrame(df['b'].tolist())
without the.values
seems to work just fine too. (And thanks, your solution is much faster than.apply()
)
– Swier
Sep 19 '16 at 7:41
I was worried about capturing index, hence explicit usage of .values.
– denfromufa
Sep 20 '16 at 3:17
1
never use apply if performance is an issue!
– Mike Palmice
Jun 5 at 13:37
add a comment |
On much larger datasets, I found that .apply()
is few orders slower than pd.DataFrame(df['b'].values.tolist(), index=df.index)
This performance issue was closed in GitHub, although I do not agree with this decision:
https://github.com/pandas-dev/pandas/issues/11615
EDIT: based on this answer: https://stackoverflow.com/a/44196843/2230844
3
pd.DataFrame(df['b'].tolist())
without the.values
seems to work just fine too. (And thanks, your solution is much faster than.apply()
)
– Swier
Sep 19 '16 at 7:41
I was worried about capturing index, hence explicit usage of .values.
– denfromufa
Sep 20 '16 at 3:17
1
never use apply if performance is an issue!
– Mike Palmice
Jun 5 at 13:37
add a comment |
On much larger datasets, I found that .apply()
is few orders slower than pd.DataFrame(df['b'].values.tolist(), index=df.index)
This performance issue was closed in GitHub, although I do not agree with this decision:
https://github.com/pandas-dev/pandas/issues/11615
EDIT: based on this answer: https://stackoverflow.com/a/44196843/2230844
On much larger datasets, I found that .apply()
is few orders slower than pd.DataFrame(df['b'].values.tolist(), index=df.index)
This performance issue was closed in GitHub, although I do not agree with this decision:
https://github.com/pandas-dev/pandas/issues/11615
EDIT: based on this answer: https://stackoverflow.com/a/44196843/2230844
edited Nov 20 at 20:15
answered Nov 17 '15 at 17:58
denfromufa
3,219330101
3,219330101
3
pd.DataFrame(df['b'].tolist())
without the.values
seems to work just fine too. (And thanks, your solution is much faster than.apply()
)
– Swier
Sep 19 '16 at 7:41
I was worried about capturing index, hence explicit usage of .values.
– denfromufa
Sep 20 '16 at 3:17
1
never use apply if performance is an issue!
– Mike Palmice
Jun 5 at 13:37
add a comment |
3
pd.DataFrame(df['b'].tolist())
without the.values
seems to work just fine too. (And thanks, your solution is much faster than.apply()
)
– Swier
Sep 19 '16 at 7:41
I was worried about capturing index, hence explicit usage of .values.
– denfromufa
Sep 20 '16 at 3:17
1
never use apply if performance is an issue!
– Mike Palmice
Jun 5 at 13:37
3
3
pd.DataFrame(df['b'].tolist())
without the .values
seems to work just fine too. (And thanks, your solution is much faster than .apply()
)– Swier
Sep 19 '16 at 7:41
pd.DataFrame(df['b'].tolist())
without the .values
seems to work just fine too. (And thanks, your solution is much faster than .apply()
)– Swier
Sep 19 '16 at 7:41
I was worried about capturing index, hence explicit usage of .values.
– denfromufa
Sep 20 '16 at 3:17
I was worried about capturing index, hence explicit usage of .values.
– denfromufa
Sep 20 '16 at 3:17
1
1
never use apply if performance is an issue!
– Mike Palmice
Jun 5 at 13:37
never use apply if performance is an issue!
– Mike Palmice
Jun 5 at 13:37
add a comment |
I know this is from a while ago, but a caveat of the second solution:
pd.DataFrame(df['b'].values.tolist())
is that it will explicitly discard the index, and add in a default sequential index, whereas the accepted answer
apply(pd.Series)
will not, since the result of apply will retain the row index. While the order is initially retained from the original array, pandas will try to match the indicies from the two dataframes.
This can be very important if you are trying to set the rows into an numerically indexed array, and pandas will automatically try to match the index of the new array to the old, and cause some distortion in the ordering.
A better hybrid solution would be to set the index of the original dataframe onto the new, i.e.
pd.DataFrame(df['b'].values.tolist(), index=df.index)
Which will retain the speed of using the second method while ensuring the order and indexing is retained on the result.
I edited my answer based on your indexing observation, thanks!
– denfromufa
Nov 20 at 20:16
add a comment |
I know this is from a while ago, but a caveat of the second solution:
pd.DataFrame(df['b'].values.tolist())
is that it will explicitly discard the index, and add in a default sequential index, whereas the accepted answer
apply(pd.Series)
will not, since the result of apply will retain the row index. While the order is initially retained from the original array, pandas will try to match the indicies from the two dataframes.
This can be very important if you are trying to set the rows into an numerically indexed array, and pandas will automatically try to match the index of the new array to the old, and cause some distortion in the ordering.
A better hybrid solution would be to set the index of the original dataframe onto the new, i.e.
pd.DataFrame(df['b'].values.tolist(), index=df.index)
Which will retain the speed of using the second method while ensuring the order and indexing is retained on the result.
I edited my answer based on your indexing observation, thanks!
– denfromufa
Nov 20 at 20:16
add a comment |
I know this is from a while ago, but a caveat of the second solution:
pd.DataFrame(df['b'].values.tolist())
is that it will explicitly discard the index, and add in a default sequential index, whereas the accepted answer
apply(pd.Series)
will not, since the result of apply will retain the row index. While the order is initially retained from the original array, pandas will try to match the indicies from the two dataframes.
This can be very important if you are trying to set the rows into an numerically indexed array, and pandas will automatically try to match the index of the new array to the old, and cause some distortion in the ordering.
A better hybrid solution would be to set the index of the original dataframe onto the new, i.e.
pd.DataFrame(df['b'].values.tolist(), index=df.index)
Which will retain the speed of using the second method while ensuring the order and indexing is retained on the result.
I know this is from a while ago, but a caveat of the second solution:
pd.DataFrame(df['b'].values.tolist())
is that it will explicitly discard the index, and add in a default sequential index, whereas the accepted answer
apply(pd.Series)
will not, since the result of apply will retain the row index. While the order is initially retained from the original array, pandas will try to match the indicies from the two dataframes.
This can be very important if you are trying to set the rows into an numerically indexed array, and pandas will automatically try to match the index of the new array to the old, and cause some distortion in the ordering.
A better hybrid solution would be to set the index of the original dataframe onto the new, i.e.
pd.DataFrame(df['b'].values.tolist(), index=df.index)
Which will retain the speed of using the second method while ensuring the order and indexing is retained on the result.
edited Jun 12 at 23:22
jpp
90k2052101
90k2052101
answered May 26 '17 at 8:20
Mike
321412
321412
I edited my answer based on your indexing observation, thanks!
– denfromufa
Nov 20 at 20:16
add a comment |
I edited my answer based on your indexing observation, thanks!
– denfromufa
Nov 20 at 20:16
I edited my answer based on your indexing observation, thanks!
– denfromufa
Nov 20 at 20:16
I edited my answer based on your indexing observation, thanks!
– denfromufa
Nov 20 at 20:16
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f29550414%2fhow-to-split-column-of-tuples-in-pandas-dataframe%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown