Filter DataFrame to Duplicated Items and Compute Groupwise Means on Result

up vote
3
down vote

favorite

Ok, so here is what I'm trying to do:

I have a DataFrame like this:

data = pd.DataFrame(

{'a' : [1,1,1,2,2,3,3,3],

 'b' : [23,45,62,24,45,34,25,62],

 })

I managed to calculate the mean of column 'a' grouped by the column 'b' by using the following code:

data.groupby('b', as_index=False)['a'].mean()

which resulted in a DataFrame like this:

current DataFrame

However, I'd like to only calculate the mean for the values of 'b' that occur more than once in the DataFrame, resulting in a Dataframe like this:

Desired DataFrame

I tried to do it by using the following line:

data.groupby('b', as_index=False).filter(lambda group: len(group)>1)['a'].mean()

but it results in the mean of the lines 1, 2, 4 and 7, which is obviously not what I want.
Can someone please help me to obtain the desired DataFrame and tell me what I'm getting wrong on the use of the filter function?

Thank you!

edited Nov 20 at 18:07

Brad Solomon

13.4k73379

asked Nov 20 at 17:56

Vinícius Silva

8019

add a comment |

up vote
3
down vote

favorite

Ok, so here is what I'm trying to do:

I have a DataFrame like this:

data = pd.DataFrame(

{'a' : [1,1,1,2,2,3,3,3],

 'b' : [23,45,62,24,45,34,25,62],

 })

I managed to calculate the mean of column 'a' grouped by the column 'b' by using the following code:

data.groupby('b', as_index=False)['a'].mean()

which resulted in a DataFrame like this:

current DataFrame

However, I'd like to only calculate the mean for the values of 'b' that occur more than once in the DataFrame, resulting in a Dataframe like this:

Desired DataFrame

I tried to do it by using the following line:

data.groupby('b', as_index=False).filter(lambda group: len(group)>1)['a'].mean()

Thank you!

edited Nov 20 at 18:07

Brad Solomon

13.4k73379

asked Nov 20 at 17:56

Vinícius Silva

8019

add a comment |

up vote
3
down vote

favorite

Ok, so here is what I'm trying to do:

I have a DataFrame like this:

data = pd.DataFrame(

{'a' : [1,1,1,2,2,3,3,3],

 'b' : [23,45,62,24,45,34,25,62],

 })

I managed to calculate the mean of column 'a' grouped by the column 'b' by using the following code:

data.groupby('b', as_index=False)['a'].mean()

which resulted in a DataFrame like this:

current DataFrame

However, I'd like to only calculate the mean for the values of 'b' that occur more than once in the DataFrame, resulting in a Dataframe like this:

Desired DataFrame

I tried to do it by using the following line:

data.groupby('b', as_index=False).filter(lambda group: len(group)>1)['a'].mean()

Thank you!

edited Nov 20 at 18:07

Brad Solomon

13.4k73379

asked Nov 20 at 17:56

Vinícius Silva

8019

Ok, so here is what I'm trying to do:

I have a DataFrame like this:

data = pd.DataFrame(

{'a' : [1,1,1,2,2,3,3,3],

 'b' : [23,45,62,24,45,34,25,62],

 })

I managed to calculate the mean of column 'a' grouped by the column 'b' by using the following code:

data.groupby('b', as_index=False)['a'].mean()

which resulted in a DataFrame like this:

current DataFrame

However, I'd like to only calculate the mean for the values of 'b' that occur more than once in the DataFrame, resulting in a Dataframe like this:

Desired DataFrame

I tried to do it by using the following line:

data.groupby('b', as_index=False).filter(lambda group: len(group)>1)['a'].mean()

Thank you!

python pandas dataframe pandas-groupby

edited Nov 20 at 18:07

Brad Solomon

13.4k73379

asked Nov 20 at 17:56

Vinícius Silva

8019

edited Nov 20 at 18:07

Brad Solomon

13.4k73379

asked Nov 20 at 17:56

Vinícius Silva

8019

edited Nov 20 at 18:07

Brad Solomon

13.4k73379

edited Nov 20 at 18:07

Brad Solomon

13.4k73379

edited Nov 20 at 18:07

Brad Solomon

13.4k73379

asked Nov 20 at 17:56

Vinícius Silva

8019

asked Nov 20 at 17:56

Vinícius Silva

8019

asked Nov 20 at 17:56

Vinícius Silva

8019

add a comment |

3 Answers
3

active

oldest

votes

up vote
3
down vote

accepted

Grouping on Duplicates

You can do this with data['b'].duplicated(keep=False) to create a boolean mask first:

>>> data[data['b'].duplicated(keep=False)].groupby('b', as_index=False)['a'].mean()                                                                         

    b    a

0  45  1.5

1  62  2.0

data.b.duplicated(keep=False) marks all duplicated occurrences as True and lets you restrict output to those rows:

>>> data.b.duplicated(keep=False)                                                                                                                        

0    False

1     True

2     True

3    False

4     True

5    False

6    False

7     True

Name: b, dtype: bool



>>> data[data.b.duplicated(keep=False)]                                                                                                                  

   a   b

1  1  45

2  1  62

4  2  45

7  3  62

Update: Grouping by Arbitrary Number of Occurrences

Can this solution be generalized to look for an arbitrary number of occurrences? Let's say I wanted to calculate the mean only for values that occurred more than 5 times on the DataFrame.

In this scenario, you need to generate a boolean mask of the same shape as in the example above, but using a slightly different approach.

Here is one way:

>>> vc = data['b'].map(data['b'].value_counts(sort=False))

>>> vc                                                                                                                                                   



0    1

1    2

2    2

3    1

4    2

5    1

6    1

7    2

Name: b, dtype: int64

These are the element-wise counts for each element of b. To get this to a mask (say you want means for only count == 2, which is the same as the above in this example, but could be extended for any int):

mask = vc == 2  # or > 5, in your case

data[mask].groupby('b', as_index=False)['a'].mean()

edited Nov 20 at 18:20

answered Nov 20 at 18:03

Brad Solomon

13.4k73379

1

This is a very nice way to do it!
– jpp
Nov 20 at 18:03

add a comment |

up vote
1
down vote

You can filter before your dataframe via loc before groupby:

df = pd.DataFrame({'a' : [1,1,1,2,2,3,3,3],

                   'b' : [23,45,62,24,45,34,25,62]})



counts = df['b'].value_counts()



res = df.loc[df['b'].isin(counts[counts > 1].index)]

        .groupby('b', as_index=False)['a'].mean()



print(res)



    b    a

0  45  1.5

1  62  2.0

answered Nov 20 at 18:01

jpp

89.5k1952100

add a comment |

up vote
1
down vote

You were pretty close:

data.groupby('b').filter(lambda g:len(g)>1).groupby('b',as_index=False).mean()

results in the answer you were looking for:

edited Nov 20 at 18:08

Brad Solomon

13.4k73379

answered Nov 20 at 18:00

jedi

1016

Would you care to explain why using groupby twice does the trick?
– Vinícius Silva
Nov 20 at 18:15

The first group by combined with filter produces records where "b" appears more than once. On a reduced subset, we can then do group by on "b" to produce mean of "a" for each "b" group.
– jedi
Nov 20 at 19:17

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53398823%2ffilter-dataframe-to-duplicated-items-and-compute-groupwise-means-on-result%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
3
down vote

accepted

Grouping on Duplicates

You can do this with data['b'].duplicated(keep=False) to create a boolean mask first:

>>> data[data['b'].duplicated(keep=False)].groupby('b', as_index=False)['a'].mean()                                                                         

    b    a

0  45  1.5

1  62  2.0

data.b.duplicated(keep=False) marks all duplicated occurrences as True and lets you restrict output to those rows:

>>> data.b.duplicated(keep=False)                                                                                                                        

0    False

1     True

2     True

3    False

4     True

5    False

6    False

7     True

Name: b, dtype: bool



>>> data[data.b.duplicated(keep=False)]                                                                                                                  

   a   b

1  1  45

2  1  62

4  2  45

7  3  62

Update: Grouping by Arbitrary Number of Occurrences

Can this solution be generalized to look for an arbitrary number of occurrences? Let's say I wanted to calculate the mean only for values that occurred more than 5 times on the DataFrame.

In this scenario, you need to generate a boolean mask of the same shape as in the example above, but using a slightly different approach.

Here is one way:

>>> vc = data['b'].map(data['b'].value_counts(sort=False))

>>> vc                                                                                                                                                   



0    1

1    2

2    2

3    1

4    2

5    1

6    1

7    2

Name: b, dtype: int64

mask = vc == 2  # or > 5, in your case

data[mask].groupby('b', as_index=False)['a'].mean()

edited Nov 20 at 18:20

answered Nov 20 at 18:03

Brad Solomon

13.4k73379

1

This is a very nice way to do it!
– jpp
Nov 20 at 18:03

add a comment |

up vote
3
down vote

accepted

Grouping on Duplicates

You can do this with data['b'].duplicated(keep=False) to create a boolean mask first:

>>> data[data['b'].duplicated(keep=False)].groupby('b', as_index=False)['a'].mean()                                                                         

    b    a

0  45  1.5

1  62  2.0

data.b.duplicated(keep=False) marks all duplicated occurrences as True and lets you restrict output to those rows:

>>> data.b.duplicated(keep=False)                                                                                                                        

0    False

1     True

2     True

3    False

4     True

5    False

6    False

7     True

Name: b, dtype: bool



>>> data[data.b.duplicated(keep=False)]                                                                                                                  

   a   b

1  1  45

2  1  62

4  2  45

7  3  62

Update: Grouping by Arbitrary Number of Occurrences

Can this solution be generalized to look for an arbitrary number of occurrences? Let's say I wanted to calculate the mean only for values that occurred more than 5 times on the DataFrame.

In this scenario, you need to generate a boolean mask of the same shape as in the example above, but using a slightly different approach.

Here is one way:

>>> vc = data['b'].map(data['b'].value_counts(sort=False))

>>> vc                                                                                                                                                   



0    1

1    2

2    2

3    1

4    2

5    1

6    1

7    2

Name: b, dtype: int64

mask = vc == 2  # or > 5, in your case

data[mask].groupby('b', as_index=False)['a'].mean()

edited Nov 20 at 18:20

answered Nov 20 at 18:03

Brad Solomon

13.4k73379

1

This is a very nice way to do it!
– jpp
Nov 20 at 18:03

add a comment |

up vote
3
down vote

accepted

Grouping on Duplicates

You can do this with data['b'].duplicated(keep=False) to create a boolean mask first:

>>> data[data['b'].duplicated(keep=False)].groupby('b', as_index=False)['a'].mean()                                                                         

    b    a

0  45  1.5

1  62  2.0

data.b.duplicated(keep=False) marks all duplicated occurrences as True and lets you restrict output to those rows:

>>> data.b.duplicated(keep=False)                                                                                                                        

0    False

1     True

2     True

3    False

4     True

5    False

6    False

7     True

Name: b, dtype: bool



>>> data[data.b.duplicated(keep=False)]                                                                                                                  

   a   b

1  1  45

2  1  62

4  2  45

7  3  62

Update: Grouping by Arbitrary Number of Occurrences

Can this solution be generalized to look for an arbitrary number of occurrences? Let's say I wanted to calculate the mean only for values that occurred more than 5 times on the DataFrame.

In this scenario, you need to generate a boolean mask of the same shape as in the example above, but using a slightly different approach.

Here is one way:

>>> vc = data['b'].map(data['b'].value_counts(sort=False))

>>> vc                                                                                                                                                   



0    1

1    2

2    2

3    1

4    2

5    1

6    1

7    2

Name: b, dtype: int64

mask = vc == 2  # or > 5, in your case

data[mask].groupby('b', as_index=False)['a'].mean()

edited Nov 20 at 18:20

answered Nov 20 at 18:03

Brad Solomon

13.4k73379

Grouping on Duplicates

You can do this with data['b'].duplicated(keep=False) to create a boolean mask first:

>>> data[data['b'].duplicated(keep=False)].groupby('b', as_index=False)['a'].mean()                                                                         

    b    a

0  45  1.5

1  62  2.0

data.b.duplicated(keep=False) marks all duplicated occurrences as True and lets you restrict output to those rows:

>>> data.b.duplicated(keep=False)                                                                                                                        

0    False

1     True

2     True

3    False

4     True

5    False

6    False

7     True

Name: b, dtype: bool



>>> data[data.b.duplicated(keep=False)]                                                                                                                  

   a   b

1  1  45

2  1  62

4  2  45

7  3  62

Update: Grouping by Arbitrary Number of Occurrences

Can this solution be generalized to look for an arbitrary number of occurrences? Let's say I wanted to calculate the mean only for values that occurred more than 5 times on the DataFrame.

In this scenario, you need to generate a boolean mask of the same shape as in the example above, but using a slightly different approach.

Here is one way:

>>> vc = data['b'].map(data['b'].value_counts(sort=False))

>>> vc                                                                                                                                                   



0    1

1    2

2    2

3    1

4    2

5    1

6    1

7    2

Name: b, dtype: int64

mask = vc == 2  # or > 5, in your case

data[mask].groupby('b', as_index=False)['a'].mean()

edited Nov 20 at 18:20

answered Nov 20 at 18:03

Brad Solomon

13.4k73379

edited Nov 20 at 18:20

answered Nov 20 at 18:03

Brad Solomon

13.4k73379

answered Nov 20 at 18:03

Brad Solomon

13.4k73379

answered Nov 20 at 18:03

Brad Solomon

13.4k73379

1

This is a very nice way to do it!
– jpp
Nov 20 at 18:03

add a comment |

1

This is a very nice way to do it!
– jpp
Nov 20 at 18:03

This is a very nice way to do it!
– jpp
Nov 20 at 18:03

add a comment |

up vote
1
down vote

You can filter before your dataframe via loc before groupby:

df = pd.DataFrame({'a' : [1,1,1,2,2,3,3,3],

                   'b' : [23,45,62,24,45,34,25,62]})



counts = df['b'].value_counts()



res = df.loc[df['b'].isin(counts[counts > 1].index)]

        .groupby('b', as_index=False)['a'].mean()



print(res)



    b    a

0  45  1.5

1  62  2.0

answered Nov 20 at 18:01

jpp

89.5k1952100

add a comment |

up vote
1
down vote

You can filter before your dataframe via loc before groupby:

df = pd.DataFrame({'a' : [1,1,1,2,2,3,3,3],

                   'b' : [23,45,62,24,45,34,25,62]})



counts = df['b'].value_counts()



res = df.loc[df['b'].isin(counts[counts > 1].index)]

        .groupby('b', as_index=False)['a'].mean()



print(res)



    b    a

0  45  1.5

1  62  2.0

answered Nov 20 at 18:01

jpp

89.5k1952100

add a comment |

up vote
1
down vote

You can filter before your dataframe via loc before groupby:

df = pd.DataFrame({'a' : [1,1,1,2,2,3,3,3],

                   'b' : [23,45,62,24,45,34,25,62]})



counts = df['b'].value_counts()



res = df.loc[df['b'].isin(counts[counts > 1].index)]

        .groupby('b', as_index=False)['a'].mean()



print(res)



    b    a

0  45  1.5

1  62  2.0

answered Nov 20 at 18:01

jpp

89.5k1952100

You can filter before your dataframe via loc before groupby:

df = pd.DataFrame({'a' : [1,1,1,2,2,3,3,3],

                   'b' : [23,45,62,24,45,34,25,62]})



counts = df['b'].value_counts()



res = df.loc[df['b'].isin(counts[counts > 1].index)]

        .groupby('b', as_index=False)['a'].mean()



print(res)



    b    a

0  45  1.5

1  62  2.0

answered Nov 20 at 18:01

jpp

89.5k1952100

answered Nov 20 at 18:01

jpp

89.5k1952100

answered Nov 20 at 18:01

jpp

89.5k1952100

answered Nov 20 at 18:01

jpp

89.5k1952100

add a comment |

up vote
1
down vote

You were pretty close:

data.groupby('b').filter(lambda g:len(g)>1).groupby('b',as_index=False).mean()

results in the answer you were looking for:

edited Nov 20 at 18:08

Brad Solomon

13.4k73379

answered Nov 20 at 18:00

jedi

1016

Would you care to explain why using groupby twice does the trick?
– Vinícius Silva
Nov 20 at 18:15

The first group by combined with filter produces records where "b" appears more than once. On a reduced subset, we can then do group by on "b" to produce mean of "a" for each "b" group.
– jedi
Nov 20 at 19:17

add a comment |

up vote
1
down vote

You were pretty close:

data.groupby('b').filter(lambda g:len(g)>1).groupby('b',as_index=False).mean()

results in the answer you were looking for:

edited Nov 20 at 18:08

Brad Solomon

13.4k73379

answered Nov 20 at 18:00

jedi

1016

Would you care to explain why using groupby twice does the trick?
– Vinícius Silva
Nov 20 at 18:15

The first group by combined with filter produces records where "b" appears more than once. On a reduced subset, we can then do group by on "b" to produce mean of "a" for each "b" group.
– jedi
Nov 20 at 19:17

add a comment |

up vote
1
down vote

You were pretty close:

data.groupby('b').filter(lambda g:len(g)>1).groupby('b',as_index=False).mean()

results in the answer you were looking for:

edited Nov 20 at 18:08

Brad Solomon

13.4k73379

answered Nov 20 at 18:00

jedi

1016

You were pretty close:

data.groupby('b').filter(lambda g:len(g)>1).groupby('b',as_index=False).mean()

results in the answer you were looking for:

edited Nov 20 at 18:08

Brad Solomon

13.4k73379

answered Nov 20 at 18:00

jedi

1016

edited Nov 20 at 18:08

Brad Solomon

13.4k73379

edited Nov 20 at 18:08

Brad Solomon

13.4k73379

edited Nov 20 at 18:08

Brad Solomon

13.4k73379

answered Nov 20 at 18:00

jedi

1016

answered Nov 20 at 18:00

jedi

1016

answered Nov 20 at 18:00

jedi

1016

Would you care to explain why using groupby twice does the trick?
– Vinícius Silva
Nov 20 at 18:15

The first group by combined with filter produces records where "b" appears more than once. On a reduced subset, we can then do group by on "b" to produce mean of "a" for each "b" group.
– jedi
Nov 20 at 19:17

add a comment |

Would you care to explain why using groupby twice does the trick?
– Vinícius Silva
Nov 20 at 18:15

The first group by combined with filter produces records where "b" appears more than once. On a reduced subset, we can then do group by on "b" to produce mean of "a" for each "b" group.
– jedi
Nov 20 at 19:17

Would you care to explain why using groupby twice does the trick?
– Vinícius Silva
Nov 20 at 18:15

The first group by combined with filter produces records where "b" appears more than once. On a reduced subset, we can then do group by on "b" to produce mean of "a" for each "b" group.
– jedi
Nov 20 at 19:17

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

u56whwjLuT,E0UIZI FN3p,OD3O,a

搜尋此網誌

Tukukkk