Filter DataFrame to Duplicated Items and Compute Groupwise Means on Result











up vote
3
down vote

favorite












Ok, so here is what I'm trying to do:



I have a DataFrame like this:



data = pd.DataFrame(
{'a' : [1,1,1,2,2,3,3,3],
'b' : [23,45,62,24,45,34,25,62],
})


I managed to calculate the mean of column 'a' grouped by the column 'b' by using the following code:



data.groupby('b', as_index=False)['a'].mean()


which resulted in a DataFrame like this:



current DataFrame



However, I'd like to only calculate the mean for the values of 'b' that occur more than once in the DataFrame, resulting in a Dataframe like this:



Desired DataFrame



I tried to do it by using the following line:



data.groupby('b', as_index=False).filter(lambda group: len(group)>1)['a'].mean()


but it results in the mean of the lines 1, 2, 4 and 7, which is obviously not what I want.
Can someone please help me to obtain the desired DataFrame and tell me what I'm getting wrong on the use of the filter function?



Thank you!










share|improve this question




























    up vote
    3
    down vote

    favorite












    Ok, so here is what I'm trying to do:



    I have a DataFrame like this:



    data = pd.DataFrame(
    {'a' : [1,1,1,2,2,3,3,3],
    'b' : [23,45,62,24,45,34,25,62],
    })


    I managed to calculate the mean of column 'a' grouped by the column 'b' by using the following code:



    data.groupby('b', as_index=False)['a'].mean()


    which resulted in a DataFrame like this:



    current DataFrame



    However, I'd like to only calculate the mean for the values of 'b' that occur more than once in the DataFrame, resulting in a Dataframe like this:



    Desired DataFrame



    I tried to do it by using the following line:



    data.groupby('b', as_index=False).filter(lambda group: len(group)>1)['a'].mean()


    but it results in the mean of the lines 1, 2, 4 and 7, which is obviously not what I want.
    Can someone please help me to obtain the desired DataFrame and tell me what I'm getting wrong on the use of the filter function?



    Thank you!










    share|improve this question


























      up vote
      3
      down vote

      favorite









      up vote
      3
      down vote

      favorite











      Ok, so here is what I'm trying to do:



      I have a DataFrame like this:



      data = pd.DataFrame(
      {'a' : [1,1,1,2,2,3,3,3],
      'b' : [23,45,62,24,45,34,25,62],
      })


      I managed to calculate the mean of column 'a' grouped by the column 'b' by using the following code:



      data.groupby('b', as_index=False)['a'].mean()


      which resulted in a DataFrame like this:



      current DataFrame



      However, I'd like to only calculate the mean for the values of 'b' that occur more than once in the DataFrame, resulting in a Dataframe like this:



      Desired DataFrame



      I tried to do it by using the following line:



      data.groupby('b', as_index=False).filter(lambda group: len(group)>1)['a'].mean()


      but it results in the mean of the lines 1, 2, 4 and 7, which is obviously not what I want.
      Can someone please help me to obtain the desired DataFrame and tell me what I'm getting wrong on the use of the filter function?



      Thank you!










      share|improve this question















      Ok, so here is what I'm trying to do:



      I have a DataFrame like this:



      data = pd.DataFrame(
      {'a' : [1,1,1,2,2,3,3,3],
      'b' : [23,45,62,24,45,34,25,62],
      })


      I managed to calculate the mean of column 'a' grouped by the column 'b' by using the following code:



      data.groupby('b', as_index=False)['a'].mean()


      which resulted in a DataFrame like this:



      current DataFrame



      However, I'd like to only calculate the mean for the values of 'b' that occur more than once in the DataFrame, resulting in a Dataframe like this:



      Desired DataFrame



      I tried to do it by using the following line:



      data.groupby('b', as_index=False).filter(lambda group: len(group)>1)['a'].mean()


      but it results in the mean of the lines 1, 2, 4 and 7, which is obviously not what I want.
      Can someone please help me to obtain the desired DataFrame and tell me what I'm getting wrong on the use of the filter function?



      Thank you!







      python pandas dataframe pandas-groupby






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 20 at 18:07









      Brad Solomon

      13.4k73379




      13.4k73379










      asked Nov 20 at 17:56









      Vinícius Silva

      8019




      8019
























          3 Answers
          3






          active

          oldest

          votes

















          up vote
          3
          down vote



          accepted










          Grouping on Duplicates



          You can do this with data['b'].duplicated(keep=False) to create a boolean mask first:



          >>> data[data['b'].duplicated(keep=False)].groupby('b', as_index=False)['a'].mean()                                                                         
          b a
          0 45 1.5
          1 62 2.0


          data.b.duplicated(keep=False) marks all duplicated occurrences as True and lets you restrict output to those rows:



          >>> data.b.duplicated(keep=False)                                                                                                                        
          0 False
          1 True
          2 True
          3 False
          4 True
          5 False
          6 False
          7 True
          Name: b, dtype: bool

          >>> data[data.b.duplicated(keep=False)]
          a b
          1 1 45
          2 1 62
          4 2 45
          7 3 62


          Update: Grouping by Arbitrary Number of Occurrences




          Can this solution be generalized to look for an arbitrary number of occurrences? Let's say I wanted to calculate the mean only for values that occurred more than 5 times on the DataFrame.




          In this scenario, you need to generate a boolean mask of the same shape as in the example above, but using a slightly different approach.



          Here is one way:



          >>> vc = data['b'].map(data['b'].value_counts(sort=False))
          >>> vc

          0 1
          1 2
          2 2
          3 1
          4 2
          5 1
          6 1
          7 2
          Name: b, dtype: int64


          These are the element-wise counts for each element of b. To get this to a mask (say you want means for only count == 2, which is the same as the above in this example, but could be extended for any int):



          mask = vc == 2  # or > 5, in your case
          data[mask].groupby('b', as_index=False)['a'].mean()





          share|improve this answer



















          • 1




            This is a very nice way to do it!
            – jpp
            Nov 20 at 18:03


















          up vote
          1
          down vote













          You can filter before your dataframe via loc before groupby:



          df = pd.DataFrame({'a' : [1,1,1,2,2,3,3,3],
          'b' : [23,45,62,24,45,34,25,62]})

          counts = df['b'].value_counts()

          res = df.loc[df['b'].isin(counts[counts > 1].index)]
          .groupby('b', as_index=False)['a'].mean()

          print(res)

          b a
          0 45 1.5
          1 62 2.0





          share|improve this answer




























            up vote
            1
            down vote













            You were pretty close:



            data.groupby('b').filter(lambda g:len(g)>1).groupby('b',as_index=False).mean()


            results in the answer you were looking for:



                b    a
            0 45 1.5
            1 62 2.0





            share|improve this answer























            • Would you care to explain why using groupby twice does the trick?
              – Vinícius Silva
              Nov 20 at 18:15










            • The first group by combined with filter produces records where "b" appears more than once. On a reduced subset, we can then do group by on "b" to produce mean of "a" for each "b" group.
              – jedi
              Nov 20 at 19:17











            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53398823%2ffilter-dataframe-to-duplicated-items-and-compute-groupwise-means-on-result%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            3 Answers
            3






            active

            oldest

            votes








            3 Answers
            3






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            3
            down vote



            accepted










            Grouping on Duplicates



            You can do this with data['b'].duplicated(keep=False) to create a boolean mask first:



            >>> data[data['b'].duplicated(keep=False)].groupby('b', as_index=False)['a'].mean()                                                                         
            b a
            0 45 1.5
            1 62 2.0


            data.b.duplicated(keep=False) marks all duplicated occurrences as True and lets you restrict output to those rows:



            >>> data.b.duplicated(keep=False)                                                                                                                        
            0 False
            1 True
            2 True
            3 False
            4 True
            5 False
            6 False
            7 True
            Name: b, dtype: bool

            >>> data[data.b.duplicated(keep=False)]
            a b
            1 1 45
            2 1 62
            4 2 45
            7 3 62


            Update: Grouping by Arbitrary Number of Occurrences




            Can this solution be generalized to look for an arbitrary number of occurrences? Let's say I wanted to calculate the mean only for values that occurred more than 5 times on the DataFrame.




            In this scenario, you need to generate a boolean mask of the same shape as in the example above, but using a slightly different approach.



            Here is one way:



            >>> vc = data['b'].map(data['b'].value_counts(sort=False))
            >>> vc

            0 1
            1 2
            2 2
            3 1
            4 2
            5 1
            6 1
            7 2
            Name: b, dtype: int64


            These are the element-wise counts for each element of b. To get this to a mask (say you want means for only count == 2, which is the same as the above in this example, but could be extended for any int):



            mask = vc == 2  # or > 5, in your case
            data[mask].groupby('b', as_index=False)['a'].mean()





            share|improve this answer



















            • 1




              This is a very nice way to do it!
              – jpp
              Nov 20 at 18:03















            up vote
            3
            down vote



            accepted










            Grouping on Duplicates



            You can do this with data['b'].duplicated(keep=False) to create a boolean mask first:



            >>> data[data['b'].duplicated(keep=False)].groupby('b', as_index=False)['a'].mean()                                                                         
            b a
            0 45 1.5
            1 62 2.0


            data.b.duplicated(keep=False) marks all duplicated occurrences as True and lets you restrict output to those rows:



            >>> data.b.duplicated(keep=False)                                                                                                                        
            0 False
            1 True
            2 True
            3 False
            4 True
            5 False
            6 False
            7 True
            Name: b, dtype: bool

            >>> data[data.b.duplicated(keep=False)]
            a b
            1 1 45
            2 1 62
            4 2 45
            7 3 62


            Update: Grouping by Arbitrary Number of Occurrences




            Can this solution be generalized to look for an arbitrary number of occurrences? Let's say I wanted to calculate the mean only for values that occurred more than 5 times on the DataFrame.




            In this scenario, you need to generate a boolean mask of the same shape as in the example above, but using a slightly different approach.



            Here is one way:



            >>> vc = data['b'].map(data['b'].value_counts(sort=False))
            >>> vc

            0 1
            1 2
            2 2
            3 1
            4 2
            5 1
            6 1
            7 2
            Name: b, dtype: int64


            These are the element-wise counts for each element of b. To get this to a mask (say you want means for only count == 2, which is the same as the above in this example, but could be extended for any int):



            mask = vc == 2  # or > 5, in your case
            data[mask].groupby('b', as_index=False)['a'].mean()





            share|improve this answer



















            • 1




              This is a very nice way to do it!
              – jpp
              Nov 20 at 18:03













            up vote
            3
            down vote



            accepted







            up vote
            3
            down vote



            accepted






            Grouping on Duplicates



            You can do this with data['b'].duplicated(keep=False) to create a boolean mask first:



            >>> data[data['b'].duplicated(keep=False)].groupby('b', as_index=False)['a'].mean()                                                                         
            b a
            0 45 1.5
            1 62 2.0


            data.b.duplicated(keep=False) marks all duplicated occurrences as True and lets you restrict output to those rows:



            >>> data.b.duplicated(keep=False)                                                                                                                        
            0 False
            1 True
            2 True
            3 False
            4 True
            5 False
            6 False
            7 True
            Name: b, dtype: bool

            >>> data[data.b.duplicated(keep=False)]
            a b
            1 1 45
            2 1 62
            4 2 45
            7 3 62


            Update: Grouping by Arbitrary Number of Occurrences




            Can this solution be generalized to look for an arbitrary number of occurrences? Let's say I wanted to calculate the mean only for values that occurred more than 5 times on the DataFrame.




            In this scenario, you need to generate a boolean mask of the same shape as in the example above, but using a slightly different approach.



            Here is one way:



            >>> vc = data['b'].map(data['b'].value_counts(sort=False))
            >>> vc

            0 1
            1 2
            2 2
            3 1
            4 2
            5 1
            6 1
            7 2
            Name: b, dtype: int64


            These are the element-wise counts for each element of b. To get this to a mask (say you want means for only count == 2, which is the same as the above in this example, but could be extended for any int):



            mask = vc == 2  # or > 5, in your case
            data[mask].groupby('b', as_index=False)['a'].mean()





            share|improve this answer














            Grouping on Duplicates



            You can do this with data['b'].duplicated(keep=False) to create a boolean mask first:



            >>> data[data['b'].duplicated(keep=False)].groupby('b', as_index=False)['a'].mean()                                                                         
            b a
            0 45 1.5
            1 62 2.0


            data.b.duplicated(keep=False) marks all duplicated occurrences as True and lets you restrict output to those rows:



            >>> data.b.duplicated(keep=False)                                                                                                                        
            0 False
            1 True
            2 True
            3 False
            4 True
            5 False
            6 False
            7 True
            Name: b, dtype: bool

            >>> data[data.b.duplicated(keep=False)]
            a b
            1 1 45
            2 1 62
            4 2 45
            7 3 62


            Update: Grouping by Arbitrary Number of Occurrences




            Can this solution be generalized to look for an arbitrary number of occurrences? Let's say I wanted to calculate the mean only for values that occurred more than 5 times on the DataFrame.




            In this scenario, you need to generate a boolean mask of the same shape as in the example above, but using a slightly different approach.



            Here is one way:



            >>> vc = data['b'].map(data['b'].value_counts(sort=False))
            >>> vc

            0 1
            1 2
            2 2
            3 1
            4 2
            5 1
            6 1
            7 2
            Name: b, dtype: int64


            These are the element-wise counts for each element of b. To get this to a mask (say you want means for only count == 2, which is the same as the above in this example, but could be extended for any int):



            mask = vc == 2  # or > 5, in your case
            data[mask].groupby('b', as_index=False)['a'].mean()






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Nov 20 at 18:20

























            answered Nov 20 at 18:03









            Brad Solomon

            13.4k73379




            13.4k73379








            • 1




              This is a very nice way to do it!
              – jpp
              Nov 20 at 18:03














            • 1




              This is a very nice way to do it!
              – jpp
              Nov 20 at 18:03








            1




            1




            This is a very nice way to do it!
            – jpp
            Nov 20 at 18:03




            This is a very nice way to do it!
            – jpp
            Nov 20 at 18:03












            up vote
            1
            down vote













            You can filter before your dataframe via loc before groupby:



            df = pd.DataFrame({'a' : [1,1,1,2,2,3,3,3],
            'b' : [23,45,62,24,45,34,25,62]})

            counts = df['b'].value_counts()

            res = df.loc[df['b'].isin(counts[counts > 1].index)]
            .groupby('b', as_index=False)['a'].mean()

            print(res)

            b a
            0 45 1.5
            1 62 2.0





            share|improve this answer

























              up vote
              1
              down vote













              You can filter before your dataframe via loc before groupby:



              df = pd.DataFrame({'a' : [1,1,1,2,2,3,3,3],
              'b' : [23,45,62,24,45,34,25,62]})

              counts = df['b'].value_counts()

              res = df.loc[df['b'].isin(counts[counts > 1].index)]
              .groupby('b', as_index=False)['a'].mean()

              print(res)

              b a
              0 45 1.5
              1 62 2.0





              share|improve this answer























                up vote
                1
                down vote










                up vote
                1
                down vote









                You can filter before your dataframe via loc before groupby:



                df = pd.DataFrame({'a' : [1,1,1,2,2,3,3,3],
                'b' : [23,45,62,24,45,34,25,62]})

                counts = df['b'].value_counts()

                res = df.loc[df['b'].isin(counts[counts > 1].index)]
                .groupby('b', as_index=False)['a'].mean()

                print(res)

                b a
                0 45 1.5
                1 62 2.0





                share|improve this answer












                You can filter before your dataframe via loc before groupby:



                df = pd.DataFrame({'a' : [1,1,1,2,2,3,3,3],
                'b' : [23,45,62,24,45,34,25,62]})

                counts = df['b'].value_counts()

                res = df.loc[df['b'].isin(counts[counts > 1].index)]
                .groupby('b', as_index=False)['a'].mean()

                print(res)

                b a
                0 45 1.5
                1 62 2.0






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 20 at 18:01









                jpp

                89.5k1952100




                89.5k1952100






















                    up vote
                    1
                    down vote













                    You were pretty close:



                    data.groupby('b').filter(lambda g:len(g)>1).groupby('b',as_index=False).mean()


                    results in the answer you were looking for:



                        b    a
                    0 45 1.5
                    1 62 2.0





                    share|improve this answer























                    • Would you care to explain why using groupby twice does the trick?
                      – Vinícius Silva
                      Nov 20 at 18:15










                    • The first group by combined with filter produces records where "b" appears more than once. On a reduced subset, we can then do group by on "b" to produce mean of "a" for each "b" group.
                      – jedi
                      Nov 20 at 19:17















                    up vote
                    1
                    down vote













                    You were pretty close:



                    data.groupby('b').filter(lambda g:len(g)>1).groupby('b',as_index=False).mean()


                    results in the answer you were looking for:



                        b    a
                    0 45 1.5
                    1 62 2.0





                    share|improve this answer























                    • Would you care to explain why using groupby twice does the trick?
                      – Vinícius Silva
                      Nov 20 at 18:15










                    • The first group by combined with filter produces records where "b" appears more than once. On a reduced subset, we can then do group by on "b" to produce mean of "a" for each "b" group.
                      – jedi
                      Nov 20 at 19:17













                    up vote
                    1
                    down vote










                    up vote
                    1
                    down vote









                    You were pretty close:



                    data.groupby('b').filter(lambda g:len(g)>1).groupby('b',as_index=False).mean()


                    results in the answer you were looking for:



                        b    a
                    0 45 1.5
                    1 62 2.0





                    share|improve this answer














                    You were pretty close:



                    data.groupby('b').filter(lambda g:len(g)>1).groupby('b',as_index=False).mean()


                    results in the answer you were looking for:



                        b    a
                    0 45 1.5
                    1 62 2.0






                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Nov 20 at 18:08









                    Brad Solomon

                    13.4k73379




                    13.4k73379










                    answered Nov 20 at 18:00









                    jedi

                    1016




                    1016












                    • Would you care to explain why using groupby twice does the trick?
                      – Vinícius Silva
                      Nov 20 at 18:15










                    • The first group by combined with filter produces records where "b" appears more than once. On a reduced subset, we can then do group by on "b" to produce mean of "a" for each "b" group.
                      – jedi
                      Nov 20 at 19:17


















                    • Would you care to explain why using groupby twice does the trick?
                      – Vinícius Silva
                      Nov 20 at 18:15










                    • The first group by combined with filter produces records where "b" appears more than once. On a reduced subset, we can then do group by on "b" to produce mean of "a" for each "b" group.
                      – jedi
                      Nov 20 at 19:17
















                    Would you care to explain why using groupby twice does the trick?
                    – Vinícius Silva
                    Nov 20 at 18:15




                    Would you care to explain why using groupby twice does the trick?
                    – Vinícius Silva
                    Nov 20 at 18:15












                    The first group by combined with filter produces records where "b" appears more than once. On a reduced subset, we can then do group by on "b" to produce mean of "a" for each "b" group.
                    – jedi
                    Nov 20 at 19:17




                    The first group by combined with filter produces records where "b" appears more than once. On a reduced subset, we can then do group by on "b" to produce mean of "a" for each "b" group.
                    – jedi
                    Nov 20 at 19:17


















                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.





                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                    Please pay close attention to the following guidance:


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53398823%2ffilter-dataframe-to-duplicated-items-and-compute-groupwise-means-on-result%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    404 Error Contact Form 7 ajax form submitting

                    How to know if a Active Directory user can login interactively

                    TypeError: fit_transform() missing 1 required positional argument: 'X'