pandas - filter on groups which have at least one column containing non-null values in a groupby












1















I have the following python pandas dataframe:



df = pd.DataFrame({'Id': ['1', '1', '1', '2', '2', '3'], 'A': ['TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'FALSE'], 'B': [np.nan, np.nan, 'abc', np.nan, np.nan, 'def'],'C': [np.nan, np.nan, np.nan, np.nan, np.nan, '456']})

>>> print(df)
Id A B C
0 1 TRUE NaN NaN
1 1 TRUE NaN NaN
2 1 TRUE abc NaN
3 2 TRUE NaN NaN
4 2 TRUE NaN NaN
5 3 FALSE def 456


I want to end up with the following dataframe:



>>> print(dfout)
Id A B C
0 1 TRUE abc NaN


The same Id value can appear on multiple rows. Each Id will either have the value TRUE or FALSE in column A consistently on all its rows. Columns B and C can have any value, including NaN.

I want one row in dfout for each Id that has A=TRUE and show the max value seen in columns B and C. But if the only values seen in columns B and C = NaN for all of an Id's rows, then that Id is to be excluded from dfout.




  • Id 1 has A=TRUE, and has B=abc in its third row, so it meets
    the requirements.

  • Id 2 has A=TRUE, but columns B and C are NaN for
    both its rows, so it does not.

  • Id 3 has A=FALSE, so it does not
    meet requirements.


I created a groupby df on Id, then applied a mask to only include rows with A=TRUE. But having trouble understanding how to remove the rows with NaN for all rows in columns B and C.



grouped = df.groupby(['Id'])
mask = grouped['A'].transform(lambda x: 'TRUE' == x.max()).astype(bool)
df.loc[mask].reset_index(drop=True)

Id A B C
0 1 TRUE NaN NaN
1 1 TRUE NaN NaN
2 1 TRUE abc NaN
3 2 TRUE NaN NaN
4 2 TRUE NaN NaN


Then I tried several things along the lines of:



df.loc[mask].reset_index(drop=True).all(['B'],['C']).isnull


But getting errors, like:




" TypeError: unhashable type: 'list' ".




Using python 3.6, pandas 0.23.0; Looked here for help: keep dataframe rows meeting a condition into each group of the same dataframe grouped by










share|improve this question





























    1















    I have the following python pandas dataframe:



    df = pd.DataFrame({'Id': ['1', '1', '1', '2', '2', '3'], 'A': ['TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'FALSE'], 'B': [np.nan, np.nan, 'abc', np.nan, np.nan, 'def'],'C': [np.nan, np.nan, np.nan, np.nan, np.nan, '456']})

    >>> print(df)
    Id A B C
    0 1 TRUE NaN NaN
    1 1 TRUE NaN NaN
    2 1 TRUE abc NaN
    3 2 TRUE NaN NaN
    4 2 TRUE NaN NaN
    5 3 FALSE def 456


    I want to end up with the following dataframe:



    >>> print(dfout)
    Id A B C
    0 1 TRUE abc NaN


    The same Id value can appear on multiple rows. Each Id will either have the value TRUE or FALSE in column A consistently on all its rows. Columns B and C can have any value, including NaN.

    I want one row in dfout for each Id that has A=TRUE and show the max value seen in columns B and C. But if the only values seen in columns B and C = NaN for all of an Id's rows, then that Id is to be excluded from dfout.




    • Id 1 has A=TRUE, and has B=abc in its third row, so it meets
      the requirements.

    • Id 2 has A=TRUE, but columns B and C are NaN for
      both its rows, so it does not.

    • Id 3 has A=FALSE, so it does not
      meet requirements.


    I created a groupby df on Id, then applied a mask to only include rows with A=TRUE. But having trouble understanding how to remove the rows with NaN for all rows in columns B and C.



    grouped = df.groupby(['Id'])
    mask = grouped['A'].transform(lambda x: 'TRUE' == x.max()).astype(bool)
    df.loc[mask].reset_index(drop=True)

    Id A B C
    0 1 TRUE NaN NaN
    1 1 TRUE NaN NaN
    2 1 TRUE abc NaN
    3 2 TRUE NaN NaN
    4 2 TRUE NaN NaN


    Then I tried several things along the lines of:



    df.loc[mask].reset_index(drop=True).all(['B'],['C']).isnull


    But getting errors, like:




    " TypeError: unhashable type: 'list' ".




    Using python 3.6, pandas 0.23.0; Looked here for help: keep dataframe rows meeting a condition into each group of the same dataframe grouped by










    share|improve this question



























      1












      1








      1








      I have the following python pandas dataframe:



      df = pd.DataFrame({'Id': ['1', '1', '1', '2', '2', '3'], 'A': ['TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'FALSE'], 'B': [np.nan, np.nan, 'abc', np.nan, np.nan, 'def'],'C': [np.nan, np.nan, np.nan, np.nan, np.nan, '456']})

      >>> print(df)
      Id A B C
      0 1 TRUE NaN NaN
      1 1 TRUE NaN NaN
      2 1 TRUE abc NaN
      3 2 TRUE NaN NaN
      4 2 TRUE NaN NaN
      5 3 FALSE def 456


      I want to end up with the following dataframe:



      >>> print(dfout)
      Id A B C
      0 1 TRUE abc NaN


      The same Id value can appear on multiple rows. Each Id will either have the value TRUE or FALSE in column A consistently on all its rows. Columns B and C can have any value, including NaN.

      I want one row in dfout for each Id that has A=TRUE and show the max value seen in columns B and C. But if the only values seen in columns B and C = NaN for all of an Id's rows, then that Id is to be excluded from dfout.




      • Id 1 has A=TRUE, and has B=abc in its third row, so it meets
        the requirements.

      • Id 2 has A=TRUE, but columns B and C are NaN for
        both its rows, so it does not.

      • Id 3 has A=FALSE, so it does not
        meet requirements.


      I created a groupby df on Id, then applied a mask to only include rows with A=TRUE. But having trouble understanding how to remove the rows with NaN for all rows in columns B and C.



      grouped = df.groupby(['Id'])
      mask = grouped['A'].transform(lambda x: 'TRUE' == x.max()).astype(bool)
      df.loc[mask].reset_index(drop=True)

      Id A B C
      0 1 TRUE NaN NaN
      1 1 TRUE NaN NaN
      2 1 TRUE abc NaN
      3 2 TRUE NaN NaN
      4 2 TRUE NaN NaN


      Then I tried several things along the lines of:



      df.loc[mask].reset_index(drop=True).all(['B'],['C']).isnull


      But getting errors, like:




      " TypeError: unhashable type: 'list' ".




      Using python 3.6, pandas 0.23.0; Looked here for help: keep dataframe rows meeting a condition into each group of the same dataframe grouped by










      share|improve this question
















      I have the following python pandas dataframe:



      df = pd.DataFrame({'Id': ['1', '1', '1', '2', '2', '3'], 'A': ['TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'FALSE'], 'B': [np.nan, np.nan, 'abc', np.nan, np.nan, 'def'],'C': [np.nan, np.nan, np.nan, np.nan, np.nan, '456']})

      >>> print(df)
      Id A B C
      0 1 TRUE NaN NaN
      1 1 TRUE NaN NaN
      2 1 TRUE abc NaN
      3 2 TRUE NaN NaN
      4 2 TRUE NaN NaN
      5 3 FALSE def 456


      I want to end up with the following dataframe:



      >>> print(dfout)
      Id A B C
      0 1 TRUE abc NaN


      The same Id value can appear on multiple rows. Each Id will either have the value TRUE or FALSE in column A consistently on all its rows. Columns B and C can have any value, including NaN.

      I want one row in dfout for each Id that has A=TRUE and show the max value seen in columns B and C. But if the only values seen in columns B and C = NaN for all of an Id's rows, then that Id is to be excluded from dfout.




      • Id 1 has A=TRUE, and has B=abc in its third row, so it meets
        the requirements.

      • Id 2 has A=TRUE, but columns B and C are NaN for
        both its rows, so it does not.

      • Id 3 has A=FALSE, so it does not
        meet requirements.


      I created a groupby df on Id, then applied a mask to only include rows with A=TRUE. But having trouble understanding how to remove the rows with NaN for all rows in columns B and C.



      grouped = df.groupby(['Id'])
      mask = grouped['A'].transform(lambda x: 'TRUE' == x.max()).astype(bool)
      df.loc[mask].reset_index(drop=True)

      Id A B C
      0 1 TRUE NaN NaN
      1 1 TRUE NaN NaN
      2 1 TRUE abc NaN
      3 2 TRUE NaN NaN
      4 2 TRUE NaN NaN


      Then I tried several things along the lines of:



      df.loc[mask].reset_index(drop=True).all(['B'],['C']).isnull


      But getting errors, like:




      " TypeError: unhashable type: 'list' ".




      Using python 3.6, pandas 0.23.0; Looked here for help: keep dataframe rows meeting a condition into each group of the same dataframe grouped by







      python pandas filter






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Dec 7 '18 at 1:33







      BilboC

















      asked Nov 22 '18 at 20:25









      BilboCBilboC

      247




      247
























          1 Answer
          1






          active

          oldest

          votes


















          1














          The solution has three parts to it.




          1. Filter dataframe to keep rows where column A is True

          2. Groupby Id and use first which will return first not null value


          3. Use dropna on the resulting dataframe on columns B and C and how = all



            df.loc[df['A'] == True].groupby('Id', as_index = False).first().dropna(subset = ['B', 'C'], how = 'all')



                Id  A       B   C
            0 1 True abc NaN







          share|improve this answer



















          • 1





            perfect - very clean; i modified the bit to df['A'] == 'TRUE' for my example

            – BilboC
            Nov 22 '18 at 21:01













          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53437636%2fpandas-filter-on-groups-which-have-at-least-one-column-containing-non-null-val%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          The solution has three parts to it.




          1. Filter dataframe to keep rows where column A is True

          2. Groupby Id and use first which will return first not null value


          3. Use dropna on the resulting dataframe on columns B and C and how = all



            df.loc[df['A'] == True].groupby('Id', as_index = False).first().dropna(subset = ['B', 'C'], how = 'all')



                Id  A       B   C
            0 1 True abc NaN







          share|improve this answer



















          • 1





            perfect - very clean; i modified the bit to df['A'] == 'TRUE' for my example

            – BilboC
            Nov 22 '18 at 21:01


















          1














          The solution has three parts to it.




          1. Filter dataframe to keep rows where column A is True

          2. Groupby Id and use first which will return first not null value


          3. Use dropna on the resulting dataframe on columns B and C and how = all



            df.loc[df['A'] == True].groupby('Id', as_index = False).first().dropna(subset = ['B', 'C'], how = 'all')



                Id  A       B   C
            0 1 True abc NaN







          share|improve this answer



















          • 1





            perfect - very clean; i modified the bit to df['A'] == 'TRUE' for my example

            – BilboC
            Nov 22 '18 at 21:01
















          1












          1








          1







          The solution has three parts to it.




          1. Filter dataframe to keep rows where column A is True

          2. Groupby Id and use first which will return first not null value


          3. Use dropna on the resulting dataframe on columns B and C and how = all



            df.loc[df['A'] == True].groupby('Id', as_index = False).first().dropna(subset = ['B', 'C'], how = 'all')



                Id  A       B   C
            0 1 True abc NaN







          share|improve this answer













          The solution has three parts to it.




          1. Filter dataframe to keep rows where column A is True

          2. Groupby Id and use first which will return first not null value


          3. Use dropna on the resulting dataframe on columns B and C and how = all



            df.loc[df['A'] == True].groupby('Id', as_index = False).first().dropna(subset = ['B', 'C'], how = 'all')



                Id  A       B   C
            0 1 True abc NaN








          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 22 '18 at 20:42









          VaishaliVaishali

          19.2k41030




          19.2k41030








          • 1





            perfect - very clean; i modified the bit to df['A'] == 'TRUE' for my example

            – BilboC
            Nov 22 '18 at 21:01
















          • 1





            perfect - very clean; i modified the bit to df['A'] == 'TRUE' for my example

            – BilboC
            Nov 22 '18 at 21:01










          1




          1





          perfect - very clean; i modified the bit to df['A'] == 'TRUE' for my example

          – BilboC
          Nov 22 '18 at 21:01







          perfect - very clean; i modified the bit to df['A'] == 'TRUE' for my example

          – BilboC
          Nov 22 '18 at 21:01




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53437636%2fpandas-filter-on-groups-which-have-at-least-one-column-containing-non-null-val%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          404 Error Contact Form 7 ajax form submitting

          How to know if a Active Directory user can login interactively

          Refactoring coordinates for Minecraft Pi buildings written in Python