Filter DataFrame to Duplicated Items and Compute Groupwise Means on Result
up vote
3
down vote
favorite
Ok, so here is what I'm trying to do:
I have a DataFrame like this:
data = pd.DataFrame(
{'a' : [1,1,1,2,2,3,3,3],
'b' : [23,45,62,24,45,34,25,62],
})
I managed to calculate the mean of column 'a' grouped by the column 'b' by using the following code:
data.groupby('b', as_index=False)['a'].mean()
which resulted in a DataFrame like this:
However, I'd like to only calculate the mean for the values of 'b' that occur more than once in the DataFrame, resulting in a Dataframe like this:
I tried to do it by using the following line:
data.groupby('b', as_index=False).filter(lambda group: len(group)>1)['a'].mean()
but it results in the mean of the lines 1, 2, 4 and 7, which is obviously not what I want.
Can someone please help me to obtain the desired DataFrame and tell me what I'm getting wrong on the use of the filter function?
Thank you!
python pandas dataframe pandas-groupby
add a comment |
up vote
3
down vote
favorite
Ok, so here is what I'm trying to do:
I have a DataFrame like this:
data = pd.DataFrame(
{'a' : [1,1,1,2,2,3,3,3],
'b' : [23,45,62,24,45,34,25,62],
})
I managed to calculate the mean of column 'a' grouped by the column 'b' by using the following code:
data.groupby('b', as_index=False)['a'].mean()
which resulted in a DataFrame like this:
However, I'd like to only calculate the mean for the values of 'b' that occur more than once in the DataFrame, resulting in a Dataframe like this:
I tried to do it by using the following line:
data.groupby('b', as_index=False).filter(lambda group: len(group)>1)['a'].mean()
but it results in the mean of the lines 1, 2, 4 and 7, which is obviously not what I want.
Can someone please help me to obtain the desired DataFrame and tell me what I'm getting wrong on the use of the filter function?
Thank you!
python pandas dataframe pandas-groupby
add a comment |
up vote
3
down vote
favorite
up vote
3
down vote
favorite
Ok, so here is what I'm trying to do:
I have a DataFrame like this:
data = pd.DataFrame(
{'a' : [1,1,1,2,2,3,3,3],
'b' : [23,45,62,24,45,34,25,62],
})
I managed to calculate the mean of column 'a' grouped by the column 'b' by using the following code:
data.groupby('b', as_index=False)['a'].mean()
which resulted in a DataFrame like this:
However, I'd like to only calculate the mean for the values of 'b' that occur more than once in the DataFrame, resulting in a Dataframe like this:
I tried to do it by using the following line:
data.groupby('b', as_index=False).filter(lambda group: len(group)>1)['a'].mean()
but it results in the mean of the lines 1, 2, 4 and 7, which is obviously not what I want.
Can someone please help me to obtain the desired DataFrame and tell me what I'm getting wrong on the use of the filter function?
Thank you!
python pandas dataframe pandas-groupby
Ok, so here is what I'm trying to do:
I have a DataFrame like this:
data = pd.DataFrame(
{'a' : [1,1,1,2,2,3,3,3],
'b' : [23,45,62,24,45,34,25,62],
})
I managed to calculate the mean of column 'a' grouped by the column 'b' by using the following code:
data.groupby('b', as_index=False)['a'].mean()
which resulted in a DataFrame like this:
However, I'd like to only calculate the mean for the values of 'b' that occur more than once in the DataFrame, resulting in a Dataframe like this:
I tried to do it by using the following line:
data.groupby('b', as_index=False).filter(lambda group: len(group)>1)['a'].mean()
but it results in the mean of the lines 1, 2, 4 and 7, which is obviously not what I want.
Can someone please help me to obtain the desired DataFrame and tell me what I'm getting wrong on the use of the filter function?
Thank you!
python pandas dataframe pandas-groupby
python pandas dataframe pandas-groupby
edited Nov 20 at 18:07
Brad Solomon
13.4k73379
13.4k73379
asked Nov 20 at 17:56
Vinícius Silva
8019
8019
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
up vote
3
down vote
accepted
Grouping on Duplicates
You can do this with data['b'].duplicated(keep=False)
to create a boolean mask first:
>>> data[data['b'].duplicated(keep=False)].groupby('b', as_index=False)['a'].mean()
b a
0 45 1.5
1 62 2.0
data.b.duplicated(keep=False)
marks all duplicated occurrences as True and lets you restrict output to those rows:
>>> data.b.duplicated(keep=False)
0 False
1 True
2 True
3 False
4 True
5 False
6 False
7 True
Name: b, dtype: bool
>>> data[data.b.duplicated(keep=False)]
a b
1 1 45
2 1 62
4 2 45
7 3 62
Update: Grouping by Arbitrary Number of Occurrences
Can this solution be generalized to look for an arbitrary number of occurrences? Let's say I wanted to calculate the mean only for values that occurred more than 5 times on the DataFrame.
In this scenario, you need to generate a boolean mask of the same shape as in the example above, but using a slightly different approach.
Here is one way:
>>> vc = data['b'].map(data['b'].value_counts(sort=False))
>>> vc
0 1
1 2
2 2
3 1
4 2
5 1
6 1
7 2
Name: b, dtype: int64
These are the element-wise counts for each element of b
. To get this to a mask (say you want means for only count == 2, which is the same as the above in this example, but could be extended for any int):
mask = vc == 2 # or > 5, in your case
data[mask].groupby('b', as_index=False)['a'].mean()
1
This is a very nice way to do it!
– jpp
Nov 20 at 18:03
add a comment |
up vote
1
down vote
You can filter before your dataframe via loc
before groupby
:
df = pd.DataFrame({'a' : [1,1,1,2,2,3,3,3],
'b' : [23,45,62,24,45,34,25,62]})
counts = df['b'].value_counts()
res = df.loc[df['b'].isin(counts[counts > 1].index)]
.groupby('b', as_index=False)['a'].mean()
print(res)
b a
0 45 1.5
1 62 2.0
add a comment |
up vote
1
down vote
You were pretty close:
data.groupby('b').filter(lambda g:len(g)>1).groupby('b',as_index=False).mean()
results in the answer you were looking for:
b a
0 45 1.5
1 62 2.0
Would you care to explain why using groupby twice does the trick?
– Vinícius Silva
Nov 20 at 18:15
The first group by combined with filter produces records where "b" appears more than once. On a reduced subset, we can then do group by on "b" to produce mean of "a" for each "b" group.
– jedi
Nov 20 at 19:17
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53398823%2ffilter-dataframe-to-duplicated-items-and-compute-groupwise-means-on-result%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
accepted
Grouping on Duplicates
You can do this with data['b'].duplicated(keep=False)
to create a boolean mask first:
>>> data[data['b'].duplicated(keep=False)].groupby('b', as_index=False)['a'].mean()
b a
0 45 1.5
1 62 2.0
data.b.duplicated(keep=False)
marks all duplicated occurrences as True and lets you restrict output to those rows:
>>> data.b.duplicated(keep=False)
0 False
1 True
2 True
3 False
4 True
5 False
6 False
7 True
Name: b, dtype: bool
>>> data[data.b.duplicated(keep=False)]
a b
1 1 45
2 1 62
4 2 45
7 3 62
Update: Grouping by Arbitrary Number of Occurrences
Can this solution be generalized to look for an arbitrary number of occurrences? Let's say I wanted to calculate the mean only for values that occurred more than 5 times on the DataFrame.
In this scenario, you need to generate a boolean mask of the same shape as in the example above, but using a slightly different approach.
Here is one way:
>>> vc = data['b'].map(data['b'].value_counts(sort=False))
>>> vc
0 1
1 2
2 2
3 1
4 2
5 1
6 1
7 2
Name: b, dtype: int64
These are the element-wise counts for each element of b
. To get this to a mask (say you want means for only count == 2, which is the same as the above in this example, but could be extended for any int):
mask = vc == 2 # or > 5, in your case
data[mask].groupby('b', as_index=False)['a'].mean()
1
This is a very nice way to do it!
– jpp
Nov 20 at 18:03
add a comment |
up vote
3
down vote
accepted
Grouping on Duplicates
You can do this with data['b'].duplicated(keep=False)
to create a boolean mask first:
>>> data[data['b'].duplicated(keep=False)].groupby('b', as_index=False)['a'].mean()
b a
0 45 1.5
1 62 2.0
data.b.duplicated(keep=False)
marks all duplicated occurrences as True and lets you restrict output to those rows:
>>> data.b.duplicated(keep=False)
0 False
1 True
2 True
3 False
4 True
5 False
6 False
7 True
Name: b, dtype: bool
>>> data[data.b.duplicated(keep=False)]
a b
1 1 45
2 1 62
4 2 45
7 3 62
Update: Grouping by Arbitrary Number of Occurrences
Can this solution be generalized to look for an arbitrary number of occurrences? Let's say I wanted to calculate the mean only for values that occurred more than 5 times on the DataFrame.
In this scenario, you need to generate a boolean mask of the same shape as in the example above, but using a slightly different approach.
Here is one way:
>>> vc = data['b'].map(data['b'].value_counts(sort=False))
>>> vc
0 1
1 2
2 2
3 1
4 2
5 1
6 1
7 2
Name: b, dtype: int64
These are the element-wise counts for each element of b
. To get this to a mask (say you want means for only count == 2, which is the same as the above in this example, but could be extended for any int):
mask = vc == 2 # or > 5, in your case
data[mask].groupby('b', as_index=False)['a'].mean()
1
This is a very nice way to do it!
– jpp
Nov 20 at 18:03
add a comment |
up vote
3
down vote
accepted
up vote
3
down vote
accepted
Grouping on Duplicates
You can do this with data['b'].duplicated(keep=False)
to create a boolean mask first:
>>> data[data['b'].duplicated(keep=False)].groupby('b', as_index=False)['a'].mean()
b a
0 45 1.5
1 62 2.0
data.b.duplicated(keep=False)
marks all duplicated occurrences as True and lets you restrict output to those rows:
>>> data.b.duplicated(keep=False)
0 False
1 True
2 True
3 False
4 True
5 False
6 False
7 True
Name: b, dtype: bool
>>> data[data.b.duplicated(keep=False)]
a b
1 1 45
2 1 62
4 2 45
7 3 62
Update: Grouping by Arbitrary Number of Occurrences
Can this solution be generalized to look for an arbitrary number of occurrences? Let's say I wanted to calculate the mean only for values that occurred more than 5 times on the DataFrame.
In this scenario, you need to generate a boolean mask of the same shape as in the example above, but using a slightly different approach.
Here is one way:
>>> vc = data['b'].map(data['b'].value_counts(sort=False))
>>> vc
0 1
1 2
2 2
3 1
4 2
5 1
6 1
7 2
Name: b, dtype: int64
These are the element-wise counts for each element of b
. To get this to a mask (say you want means for only count == 2, which is the same as the above in this example, but could be extended for any int):
mask = vc == 2 # or > 5, in your case
data[mask].groupby('b', as_index=False)['a'].mean()
Grouping on Duplicates
You can do this with data['b'].duplicated(keep=False)
to create a boolean mask first:
>>> data[data['b'].duplicated(keep=False)].groupby('b', as_index=False)['a'].mean()
b a
0 45 1.5
1 62 2.0
data.b.duplicated(keep=False)
marks all duplicated occurrences as True and lets you restrict output to those rows:
>>> data.b.duplicated(keep=False)
0 False
1 True
2 True
3 False
4 True
5 False
6 False
7 True
Name: b, dtype: bool
>>> data[data.b.duplicated(keep=False)]
a b
1 1 45
2 1 62
4 2 45
7 3 62
Update: Grouping by Arbitrary Number of Occurrences
Can this solution be generalized to look for an arbitrary number of occurrences? Let's say I wanted to calculate the mean only for values that occurred more than 5 times on the DataFrame.
In this scenario, you need to generate a boolean mask of the same shape as in the example above, but using a slightly different approach.
Here is one way:
>>> vc = data['b'].map(data['b'].value_counts(sort=False))
>>> vc
0 1
1 2
2 2
3 1
4 2
5 1
6 1
7 2
Name: b, dtype: int64
These are the element-wise counts for each element of b
. To get this to a mask (say you want means for only count == 2, which is the same as the above in this example, but could be extended for any int):
mask = vc == 2 # or > 5, in your case
data[mask].groupby('b', as_index=False)['a'].mean()
edited Nov 20 at 18:20
answered Nov 20 at 18:03
Brad Solomon
13.4k73379
13.4k73379
1
This is a very nice way to do it!
– jpp
Nov 20 at 18:03
add a comment |
1
This is a very nice way to do it!
– jpp
Nov 20 at 18:03
1
1
This is a very nice way to do it!
– jpp
Nov 20 at 18:03
This is a very nice way to do it!
– jpp
Nov 20 at 18:03
add a comment |
up vote
1
down vote
You can filter before your dataframe via loc
before groupby
:
df = pd.DataFrame({'a' : [1,1,1,2,2,3,3,3],
'b' : [23,45,62,24,45,34,25,62]})
counts = df['b'].value_counts()
res = df.loc[df['b'].isin(counts[counts > 1].index)]
.groupby('b', as_index=False)['a'].mean()
print(res)
b a
0 45 1.5
1 62 2.0
add a comment |
up vote
1
down vote
You can filter before your dataframe via loc
before groupby
:
df = pd.DataFrame({'a' : [1,1,1,2,2,3,3,3],
'b' : [23,45,62,24,45,34,25,62]})
counts = df['b'].value_counts()
res = df.loc[df['b'].isin(counts[counts > 1].index)]
.groupby('b', as_index=False)['a'].mean()
print(res)
b a
0 45 1.5
1 62 2.0
add a comment |
up vote
1
down vote
up vote
1
down vote
You can filter before your dataframe via loc
before groupby
:
df = pd.DataFrame({'a' : [1,1,1,2,2,3,3,3],
'b' : [23,45,62,24,45,34,25,62]})
counts = df['b'].value_counts()
res = df.loc[df['b'].isin(counts[counts > 1].index)]
.groupby('b', as_index=False)['a'].mean()
print(res)
b a
0 45 1.5
1 62 2.0
You can filter before your dataframe via loc
before groupby
:
df = pd.DataFrame({'a' : [1,1,1,2,2,3,3,3],
'b' : [23,45,62,24,45,34,25,62]})
counts = df['b'].value_counts()
res = df.loc[df['b'].isin(counts[counts > 1].index)]
.groupby('b', as_index=False)['a'].mean()
print(res)
b a
0 45 1.5
1 62 2.0
answered Nov 20 at 18:01
jpp
89.5k1952100
89.5k1952100
add a comment |
add a comment |
up vote
1
down vote
You were pretty close:
data.groupby('b').filter(lambda g:len(g)>1).groupby('b',as_index=False).mean()
results in the answer you were looking for:
b a
0 45 1.5
1 62 2.0
Would you care to explain why using groupby twice does the trick?
– Vinícius Silva
Nov 20 at 18:15
The first group by combined with filter produces records where "b" appears more than once. On a reduced subset, we can then do group by on "b" to produce mean of "a" for each "b" group.
– jedi
Nov 20 at 19:17
add a comment |
up vote
1
down vote
You were pretty close:
data.groupby('b').filter(lambda g:len(g)>1).groupby('b',as_index=False).mean()
results in the answer you were looking for:
b a
0 45 1.5
1 62 2.0
Would you care to explain why using groupby twice does the trick?
– Vinícius Silva
Nov 20 at 18:15
The first group by combined with filter produces records where "b" appears more than once. On a reduced subset, we can then do group by on "b" to produce mean of "a" for each "b" group.
– jedi
Nov 20 at 19:17
add a comment |
up vote
1
down vote
up vote
1
down vote
You were pretty close:
data.groupby('b').filter(lambda g:len(g)>1).groupby('b',as_index=False).mean()
results in the answer you were looking for:
b a
0 45 1.5
1 62 2.0
You were pretty close:
data.groupby('b').filter(lambda g:len(g)>1).groupby('b',as_index=False).mean()
results in the answer you were looking for:
b a
0 45 1.5
1 62 2.0
edited Nov 20 at 18:08
Brad Solomon
13.4k73379
13.4k73379
answered Nov 20 at 18:00
jedi
1016
1016
Would you care to explain why using groupby twice does the trick?
– Vinícius Silva
Nov 20 at 18:15
The first group by combined with filter produces records where "b" appears more than once. On a reduced subset, we can then do group by on "b" to produce mean of "a" for each "b" group.
– jedi
Nov 20 at 19:17
add a comment |
Would you care to explain why using groupby twice does the trick?
– Vinícius Silva
Nov 20 at 18:15
The first group by combined with filter produces records where "b" appears more than once. On a reduced subset, we can then do group by on "b" to produce mean of "a" for each "b" group.
– jedi
Nov 20 at 19:17
Would you care to explain why using groupby twice does the trick?
– Vinícius Silva
Nov 20 at 18:15
Would you care to explain why using groupby twice does the trick?
– Vinícius Silva
Nov 20 at 18:15
The first group by combined with filter produces records where "b" appears more than once. On a reduced subset, we can then do group by on "b" to produce mean of "a" for each "b" group.
– jedi
Nov 20 at 19:17
The first group by combined with filter produces records where "b" appears more than once. On a reduced subset, we can then do group by on "b" to produce mean of "a" for each "b" group.
– jedi
Nov 20 at 19:17
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53398823%2ffilter-dataframe-to-duplicated-items-and-compute-groupwise-means-on-result%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown