How do I unnest (explode) a column in a pandas DataFrame?












31














I have the following DataFrame where one of the columns is an object (list type cell):



df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})
df
Out[458]:
A B
0 1 [1, 2]
1 2 [1, 2]


My expected output is:



   A  B
0 1 1
1 1 2
3 2 1
4 2 2


What should I do to achieve this?










share|improve this question




















  • 2




    Related, unnesting strings: stackoverflow.com/q/48197234/4909087
    – coldspeed
    Nov 12 '18 at 12:00










  • Couple of related posts: here, here, here, here, ...
    – Cleb
    2 days ago
















31














I have the following DataFrame where one of the columns is an object (list type cell):



df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})
df
Out[458]:
A B
0 1 [1, 2]
1 2 [1, 2]


My expected output is:



   A  B
0 1 1
1 1 2
3 2 1
4 2 2


What should I do to achieve this?










share|improve this question




















  • 2




    Related, unnesting strings: stackoverflow.com/q/48197234/4909087
    – coldspeed
    Nov 12 '18 at 12:00










  • Couple of related posts: here, here, here, here, ...
    – Cleb
    2 days ago














31












31








31


11





I have the following DataFrame where one of the columns is an object (list type cell):



df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})
df
Out[458]:
A B
0 1 [1, 2]
1 2 [1, 2]


My expected output is:



   A  B
0 1 1
1 1 2
3 2 1
4 2 2


What should I do to achieve this?










share|improve this question















I have the following DataFrame where one of the columns is an object (list type cell):



df=pd.DataFrame({'A':[1,2],'B':[[1,2],[1,2]]})
df
Out[458]:
A B
0 1 [1, 2]
1 2 [1, 2]


My expected output is:



   A  B
0 1 1
1 1 2
3 2 1
4 2 2


What should I do to achieve this?







python pandas dataframe






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 10 '18 at 15:06

























asked Nov 9 '18 at 2:19









W-B

102k73163




102k73163








  • 2




    Related, unnesting strings: stackoverflow.com/q/48197234/4909087
    – coldspeed
    Nov 12 '18 at 12:00










  • Couple of related posts: here, here, here, here, ...
    – Cleb
    2 days ago














  • 2




    Related, unnesting strings: stackoverflow.com/q/48197234/4909087
    – coldspeed
    Nov 12 '18 at 12:00










  • Couple of related posts: here, here, here, here, ...
    – Cleb
    2 days ago








2




2




Related, unnesting strings: stackoverflow.com/q/48197234/4909087
– coldspeed
Nov 12 '18 at 12:00




Related, unnesting strings: stackoverflow.com/q/48197234/4909087
– coldspeed
Nov 12 '18 at 12:00












Couple of related posts: here, here, here, here, ...
– Cleb
2 days ago




Couple of related posts: here, here, here, here, ...
– Cleb
2 days ago












5 Answers
5






active

oldest

votes


















32





+100











As an user with both R and python, I have seen this type of question a couple of times.





In R, they have the built-in function from package tidyr called unnest. But in Python(pandas) there is no built-in function for this type of question.





I know object columns type always make the data hard to convert with a pandas' function. When I received the data like this , the first thing that came to mind was to 'flatten' or unnest the columns .





Method 1
apply + pd.Series (easy to understand but in terms of performance not recommended . )



df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
Out[463]:
A B
0 1 1
1 1 2
0 2 1
1 2 2




Method 2 using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )



df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})
df
Out[465]:
A B
0 1 1
0 1 2
1 2 1
1 2 2


Method 2.1 for example besides A we have A.1 .....A.n. If we still use the method(Method 2) above it is hard for us to re-create the columns one by one .



Solution : join or merge with the index after 'unnest' the single columns



s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]:
B A
0 1 1
0 2 1
1 1 2
1 2 2


If you need the column order exactly the same as before, add reindex at the end.



s.join(df.drop('B',1),how='left').reindex(columns=df.columns)




Method 3 recreate the list



pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]:
A B
0 1 1
1 1 2
2 2 1
3 2 2


If more than two columns



s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]:
0 1 A B
0 0 1 1 [1, 2]
1 0 2 1 [1, 2]
2 1 1 2 [1, 2]
3 1 2 2 [1, 2]




Method 4 using reindex or loc



df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]:
A B
0 1 1
0 1 2
1 2 1
1 2 2

#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))


Method 5 when the list only contains unique values:



df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]})
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]:
B A
0 1 1
1 2 1
2 3 2
3 4 2


Method 6 using numpy for high performance:



newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
pd.DataFrame(data=newvalues[0],columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2




Method 7 : using base function itertools cycle and chain: Pure python solution just for fun



from itertools import cycle,chain
l=df.values.tolist()
l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l]
pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2




Special case (two columns type object)



df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]})
df
Out[592]:
A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]




Self-def function



def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')

unnesting(df,['B','C'])
Out[609]:
B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2




Summary :



I am using pandas and python functions for this type of question. If you are worried about the speed of the above solutions, check user3483203's answer , since he is using numpy and most of the time numpy is faster . I recommend Cpython and numba if speed matters in your case.






share|improve this answer































    14





    +100









    Option 1



    If all of the sublists in the other column are the same length, numpy can be an efficient option here:



    vals = np.array(df.B.values.tolist())    
    a = np.repeat(df.A, vals.shape[1])

    pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)




       A  B
    0 1 1
    1 1 2
    2 2 1
    3 2 2




    Option 2



    If the sublists have different length, you need an additional step:



    vals = df.B.values.tolist()
    rs = [len(r) for r in vals]
    a = np.repeat(df.A, rs)

    pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)




       A  B
    0 1 1
    1 1 2
    2 2 1
    3 2 2




    Option 3



    I took a shot at generalizing this to work to flatten N columns and tile M columns, I'll work later on making it more efficient:



    df = pd.DataFrame({'A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
    'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C']})




       A          B          C  D
    0 1 [1, 2] [1, 2, 3] A
    1 2 [1, 2, 3] [1, 2] B
    2 3 [1] [1, 2] C




    def unnest(df, tile, explode):
    vals = df[explode].sum(1)
    rs = [len(r) for r in vals]
    a = np.repeat(df[tile].values, rs, axis=0)
    b = np.concatenate(vals.values)
    d = np.column_stack((a, b))
    return pd.DataFrame(d, columns = tile + ['_'.join(explode)])

    unnest(df, ['A', 'D'], ['B', 'C'])




        A  D B_C
    0 1 A 1
    1 1 A 2
    2 1 A 1
    3 1 A 2
    4 1 A 3
    5 2 B 1
    6 2 B 2
    7 2 B 3
    8 2 B 1
    9 2 B 2
    10 3 C 1
    11 3 C 1
    12 3 C 2




    Functions



    def wen1(df):
    return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0: 'B'})

    def wen2(df):
    return pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})

    def wen3(df):
    s = pd.DataFrame({'B': np.concatenate(df.B.values)}, index=df.index.repeat(df.B.str.len()))
    return s.join(df.drop('B', 1), how='left')

    def wen4(df):
    return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

    def chris1(df):
    vals = np.array(df.B.values.tolist())
    a = np.repeat(df.A, vals.shape[1])
    return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

    def chris2(df):
    vals = df.B.values.tolist()
    rs = [len(r) for r in vals]
    a = np.repeat(df.A.values, rs)
    return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)


    Timings



    import pandas as pd
    import matplotlib.pyplot as plt
    import numpy as np
    from timeit import timeit

    res = pd.DataFrame(
    index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
    columns=[10, 50, 100, 500, 1000, 5000, 10000],
    dtype=float
    )

    for f in res.index:
    for c in res.columns:
    df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
    df = pd.concat([df]*c)
    stmt = '{}(df)'.format(f)
    setp = 'from __main__ import df, {}'.format(f)
    res.at[f, c] = timeit(stmt, setp, number=50)

    ax = res.div(res.min()).T.plot(loglog=True)
    ax.set_xlabel("N")
    ax.set_ylabel("time (relative)")


    Performance



    enter image description here






    share|improve this answer































      7














      One alternative is to apply the meshgrid recipe over the rows of the columns to unnest:



      import numpy as np
      import pandas as pd


      def unnest(frame, explode):
      def mesh(values):
      return np.array(np.meshgrid(*values)).T.reshape(-1, len(values))

      data = np.vstack(mesh(row) for row in frame[explode].values)
      return pd.DataFrame(data=data, columns=explode)


      df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
      print(unnest(df, ['A', 'B'])) # base
      print()

      df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [3, 4]], 'C': [[1, 2], [3, 4]]})
      print(unnest(df, ['A', 'B', 'C'])) # multiple columns
      print()

      df = pd.DataFrame({'A': [1, 2, 3], 'B': [[1, 2], [1, 2, 3], [1]],
      'C': [[1, 2, 3], [1, 2], [1, 2]], 'D': ['A', 'B', 'C']})

      print(unnest(df, ['A', 'B'])) # uneven length lists
      print()
      print(unnest(df, ['D', 'B'])) # different types
      print()


      Output



         A  B
      0 1 1
      1 1 2
      2 2 1
      3 2 2

      A B C
      0 1 1 1
      1 1 2 1
      2 1 1 2
      3 1 2 2
      4 2 3 3
      5 2 4 3
      6 2 3 4
      7 2 4 4

      A B
      0 1 1
      1 1 2
      2 2 1
      3 2 2
      4 2 3
      5 3 1

      D B
      0 A 1
      1 A 2
      2 B 1
      3 B 2
      4 B 3
      5 C 1





      share|improve this answer





























        1














        My 5 cents:



        df[['B', 'B2']] = pd.DataFrame(df['B'].values.tolist())

        df[['A', 'B']].append(df[['A', 'B2']].rename(columns={'B2': 'B'}),
        ignore_index=True)


        and another 5



        df[['B1', 'B2']] = pd.DataFrame([*df['B']]) # if values.tolist() is too boring

        (pd.wide_to_long(df.drop('B', 1), 'B', 'A', '')
        .reset_index(level=1, drop=True)
        .reset_index())


        both resulting in the same



           A  B
        0 1 1
        1 2 1
        2 1 2
        3 2 2





        share|improve this answer































          -1














          Something pretty not recommended (at least work in this case):



          df=pd.concat([df]*2).sort_index()
          it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
          df['B']=df['B'].apply(lambda x:next(it))


          concat + sort_index + iter + apply + next.



          Now:



          print(df)


          Is:



             A  B
          0 1 1
          0 1 2
          1 2 1
          1 2 2


          If care about index:



          df=df.reset_index(drop=True)


          Now:



          print(df)


          Is:



             A  B
          0 1 1
          1 1 2
          2 2 1
          3 2 2





          share|improve this answer





















            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53218931%2fhow-do-i-unnest-explode-a-column-in-a-pandas-dataframe%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            5 Answers
            5






            active

            oldest

            votes








            5 Answers
            5






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            32





            +100











            As an user with both R and python, I have seen this type of question a couple of times.





            In R, they have the built-in function from package tidyr called unnest. But in Python(pandas) there is no built-in function for this type of question.





            I know object columns type always make the data hard to convert with a pandas' function. When I received the data like this , the first thing that came to mind was to 'flatten' or unnest the columns .





            Method 1
            apply + pd.Series (easy to understand but in terms of performance not recommended . )



            df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
            Out[463]:
            A B
            0 1 1
            1 1 2
            0 2 1
            1 2 2




            Method 2 using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )



            df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})
            df
            Out[465]:
            A B
            0 1 1
            0 1 2
            1 2 1
            1 2 2


            Method 2.1 for example besides A we have A.1 .....A.n. If we still use the method(Method 2) above it is hard for us to re-create the columns one by one .



            Solution : join or merge with the index after 'unnest' the single columns



            s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))
            s.join(df.drop('B',1),how='left')
            Out[477]:
            B A
            0 1 1
            0 2 1
            1 1 2
            1 2 2


            If you need the column order exactly the same as before, add reindex at the end.



            s.join(df.drop('B',1),how='left').reindex(columns=df.columns)




            Method 3 recreate the list



            pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
            Out[488]:
            A B
            0 1 1
            1 1 2
            2 2 1
            3 2 2


            If more than two columns



            s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
            s.merge(df,left_on=0,right_index=True)
            Out[491]:
            0 1 A B
            0 0 1 1 [1, 2]
            1 0 2 1 [1, 2]
            2 1 1 2 [1, 2]
            3 1 2 2 [1, 2]




            Method 4 using reindex or loc



            df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
            Out[554]:
            A B
            0 1 1
            0 1 2
            1 2 1
            1 2 2

            #df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))


            Method 5 when the list only contains unique values:



            df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]})
            from collections import ChainMap
            d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
            pd.DataFrame(list(d.items()),columns=df.columns[::-1])
            Out[574]:
            B A
            0 1 1
            1 2 1
            2 3 2
            3 4 2


            Method 6 using numpy for high performance:



            newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
            pd.DataFrame(data=newvalues[0],columns=df.columns)
            A B
            0 1 1
            1 1 2
            2 2 1
            3 2 2




            Method 7 : using base function itertools cycle and chain: Pure python solution just for fun



            from itertools import cycle,chain
            l=df.values.tolist()
            l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l]
            pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
            A B
            0 1 1
            1 1 2
            2 2 1
            3 2 2




            Special case (two columns type object)



            df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]})
            df
            Out[592]:
            A B C
            0 1 [1, 2] [1, 2]
            1 2 [3, 4] [3, 4]




            Self-def function



            def unnesting(df, explode):
            idx=df.index.repeat(df[explode[0]].str.len())
            df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
            df1.index=idx
            return df1.join(df.drop(explode,1),how='left')

            unnesting(df,['B','C'])
            Out[609]:
            B C A
            0 1 1 1
            0 2 2 1
            1 3 3 2
            1 4 4 2




            Summary :



            I am using pandas and python functions for this type of question. If you are worried about the speed of the above solutions, check user3483203's answer , since he is using numpy and most of the time numpy is faster . I recommend Cpython and numba if speed matters in your case.






            share|improve this answer




























              32





              +100











              As an user with both R and python, I have seen this type of question a couple of times.





              In R, they have the built-in function from package tidyr called unnest. But in Python(pandas) there is no built-in function for this type of question.





              I know object columns type always make the data hard to convert with a pandas' function. When I received the data like this , the first thing that came to mind was to 'flatten' or unnest the columns .





              Method 1
              apply + pd.Series (easy to understand but in terms of performance not recommended . )



              df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
              Out[463]:
              A B
              0 1 1
              1 1 2
              0 2 1
              1 2 2




              Method 2 using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )



              df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})
              df
              Out[465]:
              A B
              0 1 1
              0 1 2
              1 2 1
              1 2 2


              Method 2.1 for example besides A we have A.1 .....A.n. If we still use the method(Method 2) above it is hard for us to re-create the columns one by one .



              Solution : join or merge with the index after 'unnest' the single columns



              s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))
              s.join(df.drop('B',1),how='left')
              Out[477]:
              B A
              0 1 1
              0 2 1
              1 1 2
              1 2 2


              If you need the column order exactly the same as before, add reindex at the end.



              s.join(df.drop('B',1),how='left').reindex(columns=df.columns)




              Method 3 recreate the list



              pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
              Out[488]:
              A B
              0 1 1
              1 1 2
              2 2 1
              3 2 2


              If more than two columns



              s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
              s.merge(df,left_on=0,right_index=True)
              Out[491]:
              0 1 A B
              0 0 1 1 [1, 2]
              1 0 2 1 [1, 2]
              2 1 1 2 [1, 2]
              3 1 2 2 [1, 2]




              Method 4 using reindex or loc



              df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
              Out[554]:
              A B
              0 1 1
              0 1 2
              1 2 1
              1 2 2

              #df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))


              Method 5 when the list only contains unique values:



              df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]})
              from collections import ChainMap
              d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
              pd.DataFrame(list(d.items()),columns=df.columns[::-1])
              Out[574]:
              B A
              0 1 1
              1 2 1
              2 3 2
              3 4 2


              Method 6 using numpy for high performance:



              newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
              pd.DataFrame(data=newvalues[0],columns=df.columns)
              A B
              0 1 1
              1 1 2
              2 2 1
              3 2 2




              Method 7 : using base function itertools cycle and chain: Pure python solution just for fun



              from itertools import cycle,chain
              l=df.values.tolist()
              l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l]
              pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
              A B
              0 1 1
              1 1 2
              2 2 1
              3 2 2




              Special case (two columns type object)



              df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]})
              df
              Out[592]:
              A B C
              0 1 [1, 2] [1, 2]
              1 2 [3, 4] [3, 4]




              Self-def function



              def unnesting(df, explode):
              idx=df.index.repeat(df[explode[0]].str.len())
              df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
              df1.index=idx
              return df1.join(df.drop(explode,1),how='left')

              unnesting(df,['B','C'])
              Out[609]:
              B C A
              0 1 1 1
              0 2 2 1
              1 3 3 2
              1 4 4 2




              Summary :



              I am using pandas and python functions for this type of question. If you are worried about the speed of the above solutions, check user3483203's answer , since he is using numpy and most of the time numpy is faster . I recommend Cpython and numba if speed matters in your case.






              share|improve this answer


























                32





                +100







                32





                +100



                32




                +100






                As an user with both R and python, I have seen this type of question a couple of times.





                In R, they have the built-in function from package tidyr called unnest. But in Python(pandas) there is no built-in function for this type of question.





                I know object columns type always make the data hard to convert with a pandas' function. When I received the data like this , the first thing that came to mind was to 'flatten' or unnest the columns .





                Method 1
                apply + pd.Series (easy to understand but in terms of performance not recommended . )



                df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
                Out[463]:
                A B
                0 1 1
                1 1 2
                0 2 1
                1 2 2




                Method 2 using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )



                df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})
                df
                Out[465]:
                A B
                0 1 1
                0 1 2
                1 2 1
                1 2 2


                Method 2.1 for example besides A we have A.1 .....A.n. If we still use the method(Method 2) above it is hard for us to re-create the columns one by one .



                Solution : join or merge with the index after 'unnest' the single columns



                s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))
                s.join(df.drop('B',1),how='left')
                Out[477]:
                B A
                0 1 1
                0 2 1
                1 1 2
                1 2 2


                If you need the column order exactly the same as before, add reindex at the end.



                s.join(df.drop('B',1),how='left').reindex(columns=df.columns)




                Method 3 recreate the list



                pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
                Out[488]:
                A B
                0 1 1
                1 1 2
                2 2 1
                3 2 2


                If more than two columns



                s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
                s.merge(df,left_on=0,right_index=True)
                Out[491]:
                0 1 A B
                0 0 1 1 [1, 2]
                1 0 2 1 [1, 2]
                2 1 1 2 [1, 2]
                3 1 2 2 [1, 2]




                Method 4 using reindex or loc



                df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
                Out[554]:
                A B
                0 1 1
                0 1 2
                1 2 1
                1 2 2

                #df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))


                Method 5 when the list only contains unique values:



                df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]})
                from collections import ChainMap
                d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
                pd.DataFrame(list(d.items()),columns=df.columns[::-1])
                Out[574]:
                B A
                0 1 1
                1 2 1
                2 3 2
                3 4 2


                Method 6 using numpy for high performance:



                newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
                pd.DataFrame(data=newvalues[0],columns=df.columns)
                A B
                0 1 1
                1 1 2
                2 2 1
                3 2 2




                Method 7 : using base function itertools cycle and chain: Pure python solution just for fun



                from itertools import cycle,chain
                l=df.values.tolist()
                l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l]
                pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
                A B
                0 1 1
                1 1 2
                2 2 1
                3 2 2




                Special case (two columns type object)



                df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]})
                df
                Out[592]:
                A B C
                0 1 [1, 2] [1, 2]
                1 2 [3, 4] [3, 4]




                Self-def function



                def unnesting(df, explode):
                idx=df.index.repeat(df[explode[0]].str.len())
                df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
                df1.index=idx
                return df1.join(df.drop(explode,1),how='left')

                unnesting(df,['B','C'])
                Out[609]:
                B C A
                0 1 1 1
                0 2 2 1
                1 3 3 2
                1 4 4 2




                Summary :



                I am using pandas and python functions for this type of question. If you are worried about the speed of the above solutions, check user3483203's answer , since he is using numpy and most of the time numpy is faster . I recommend Cpython and numba if speed matters in your case.






                share|improve this answer
















                As an user with both R and python, I have seen this type of question a couple of times.





                In R, they have the built-in function from package tidyr called unnest. But in Python(pandas) there is no built-in function for this type of question.





                I know object columns type always make the data hard to convert with a pandas' function. When I received the data like this , the first thing that came to mind was to 'flatten' or unnest the columns .





                Method 1
                apply + pd.Series (easy to understand but in terms of performance not recommended . )



                df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
                Out[463]:
                A B
                0 1 1
                1 1 2
                0 2 1
                1 2 2




                Method 2 using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )



                df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})
                df
                Out[465]:
                A B
                0 1 1
                0 1 2
                1 2 1
                1 2 2


                Method 2.1 for example besides A we have A.1 .....A.n. If we still use the method(Method 2) above it is hard for us to re-create the columns one by one .



                Solution : join or merge with the index after 'unnest' the single columns



                s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))
                s.join(df.drop('B',1),how='left')
                Out[477]:
                B A
                0 1 1
                0 2 1
                1 1 2
                1 2 2


                If you need the column order exactly the same as before, add reindex at the end.



                s.join(df.drop('B',1),how='left').reindex(columns=df.columns)




                Method 3 recreate the list



                pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
                Out[488]:
                A B
                0 1 1
                1 1 2
                2 2 1
                3 2 2


                If more than two columns



                s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
                s.merge(df,left_on=0,right_index=True)
                Out[491]:
                0 1 A B
                0 0 1 1 [1, 2]
                1 0 2 1 [1, 2]
                2 1 1 2 [1, 2]
                3 1 2 2 [1, 2]




                Method 4 using reindex or loc



                df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
                Out[554]:
                A B
                0 1 1
                0 1 2
                1 2 1
                1 2 2

                #df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))


                Method 5 when the list only contains unique values:



                df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]})
                from collections import ChainMap
                d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
                pd.DataFrame(list(d.items()),columns=df.columns[::-1])
                Out[574]:
                B A
                0 1 1
                1 2 1
                2 3 2
                3 4 2


                Method 6 using numpy for high performance:



                newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
                pd.DataFrame(data=newvalues[0],columns=df.columns)
                A B
                0 1 1
                1 1 2
                2 2 1
                3 2 2




                Method 7 : using base function itertools cycle and chain: Pure python solution just for fun



                from itertools import cycle,chain
                l=df.values.tolist()
                l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l]
                pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
                A B
                0 1 1
                1 1 2
                2 2 1
                3 2 2




                Special case (two columns type object)



                df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]})
                df
                Out[592]:
                A B C
                0 1 [1, 2] [1, 2]
                1 2 [3, 4] [3, 4]




                Self-def function



                def unnesting(df, explode):
                idx=df.index.repeat(df[explode[0]].str.len())
                df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
                df1.index=idx
                return df1.join(df.drop(explode,1),how='left')

                unnesting(df,['B','C'])
                Out[609]:
                B C A
                0 1 1 1
                0 2 2 1
                1 3 3 2
                1 4 4 2




                Summary :



                I am using pandas and python functions for this type of question. If you are worried about the speed of the above solutions, check user3483203's answer , since he is using numpy and most of the time numpy is faster . I recommend Cpython and numba if speed matters in your case.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Dec 26 '18 at 15:24









                Jack Moody

                442419




                442419










                answered Nov 9 '18 at 2:20









                W-B

                102k73163




                102k73163

























                    14





                    +100









                    Option 1



                    If all of the sublists in the other column are the same length, numpy can be an efficient option here:



                    vals = np.array(df.B.values.tolist())    
                    a = np.repeat(df.A, vals.shape[1])

                    pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)




                       A  B
                    0 1 1
                    1 1 2
                    2 2 1
                    3 2 2




                    Option 2



                    If the sublists have different length, you need an additional step:



                    vals = df.B.values.tolist()
                    rs = [len(r) for r in vals]
                    a = np.repeat(df.A, rs)

                    pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)




                       A  B
                    0 1 1
                    1 1 2
                    2 2 1
                    3 2 2




                    Option 3



                    I took a shot at generalizing this to work to flatten N columns and tile M columns, I'll work later on making it more efficient:



                    df = pd.DataFrame({'A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
                    'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C']})




                       A          B          C  D
                    0 1 [1, 2] [1, 2, 3] A
                    1 2 [1, 2, 3] [1, 2] B
                    2 3 [1] [1, 2] C




                    def unnest(df, tile, explode):
                    vals = df[explode].sum(1)
                    rs = [len(r) for r in vals]
                    a = np.repeat(df[tile].values, rs, axis=0)
                    b = np.concatenate(vals.values)
                    d = np.column_stack((a, b))
                    return pd.DataFrame(d, columns = tile + ['_'.join(explode)])

                    unnest(df, ['A', 'D'], ['B', 'C'])




                        A  D B_C
                    0 1 A 1
                    1 1 A 2
                    2 1 A 1
                    3 1 A 2
                    4 1 A 3
                    5 2 B 1
                    6 2 B 2
                    7 2 B 3
                    8 2 B 1
                    9 2 B 2
                    10 3 C 1
                    11 3 C 1
                    12 3 C 2




                    Functions



                    def wen1(df):
                    return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0: 'B'})

                    def wen2(df):
                    return pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})

                    def wen3(df):
                    s = pd.DataFrame({'B': np.concatenate(df.B.values)}, index=df.index.repeat(df.B.str.len()))
                    return s.join(df.drop('B', 1), how='left')

                    def wen4(df):
                    return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

                    def chris1(df):
                    vals = np.array(df.B.values.tolist())
                    a = np.repeat(df.A, vals.shape[1])
                    return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

                    def chris2(df):
                    vals = df.B.values.tolist()
                    rs = [len(r) for r in vals]
                    a = np.repeat(df.A.values, rs)
                    return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)


                    Timings



                    import pandas as pd
                    import matplotlib.pyplot as plt
                    import numpy as np
                    from timeit import timeit

                    res = pd.DataFrame(
                    index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
                    columns=[10, 50, 100, 500, 1000, 5000, 10000],
                    dtype=float
                    )

                    for f in res.index:
                    for c in res.columns:
                    df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
                    df = pd.concat([df]*c)
                    stmt = '{}(df)'.format(f)
                    setp = 'from __main__ import df, {}'.format(f)
                    res.at[f, c] = timeit(stmt, setp, number=50)

                    ax = res.div(res.min()).T.plot(loglog=True)
                    ax.set_xlabel("N")
                    ax.set_ylabel("time (relative)")


                    Performance



                    enter image description here






                    share|improve this answer




























                      14





                      +100









                      Option 1



                      If all of the sublists in the other column are the same length, numpy can be an efficient option here:



                      vals = np.array(df.B.values.tolist())    
                      a = np.repeat(df.A, vals.shape[1])

                      pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)




                         A  B
                      0 1 1
                      1 1 2
                      2 2 1
                      3 2 2




                      Option 2



                      If the sublists have different length, you need an additional step:



                      vals = df.B.values.tolist()
                      rs = [len(r) for r in vals]
                      a = np.repeat(df.A, rs)

                      pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)




                         A  B
                      0 1 1
                      1 1 2
                      2 2 1
                      3 2 2




                      Option 3



                      I took a shot at generalizing this to work to flatten N columns and tile M columns, I'll work later on making it more efficient:



                      df = pd.DataFrame({'A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
                      'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C']})




                         A          B          C  D
                      0 1 [1, 2] [1, 2, 3] A
                      1 2 [1, 2, 3] [1, 2] B
                      2 3 [1] [1, 2] C




                      def unnest(df, tile, explode):
                      vals = df[explode].sum(1)
                      rs = [len(r) for r in vals]
                      a = np.repeat(df[tile].values, rs, axis=0)
                      b = np.concatenate(vals.values)
                      d = np.column_stack((a, b))
                      return pd.DataFrame(d, columns = tile + ['_'.join(explode)])

                      unnest(df, ['A', 'D'], ['B', 'C'])




                          A  D B_C
                      0 1 A 1
                      1 1 A 2
                      2 1 A 1
                      3 1 A 2
                      4 1 A 3
                      5 2 B 1
                      6 2 B 2
                      7 2 B 3
                      8 2 B 1
                      9 2 B 2
                      10 3 C 1
                      11 3 C 1
                      12 3 C 2




                      Functions



                      def wen1(df):
                      return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0: 'B'})

                      def wen2(df):
                      return pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})

                      def wen3(df):
                      s = pd.DataFrame({'B': np.concatenate(df.B.values)}, index=df.index.repeat(df.B.str.len()))
                      return s.join(df.drop('B', 1), how='left')

                      def wen4(df):
                      return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

                      def chris1(df):
                      vals = np.array(df.B.values.tolist())
                      a = np.repeat(df.A, vals.shape[1])
                      return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

                      def chris2(df):
                      vals = df.B.values.tolist()
                      rs = [len(r) for r in vals]
                      a = np.repeat(df.A.values, rs)
                      return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)


                      Timings



                      import pandas as pd
                      import matplotlib.pyplot as plt
                      import numpy as np
                      from timeit import timeit

                      res = pd.DataFrame(
                      index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
                      columns=[10, 50, 100, 500, 1000, 5000, 10000],
                      dtype=float
                      )

                      for f in res.index:
                      for c in res.columns:
                      df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
                      df = pd.concat([df]*c)
                      stmt = '{}(df)'.format(f)
                      setp = 'from __main__ import df, {}'.format(f)
                      res.at[f, c] = timeit(stmt, setp, number=50)

                      ax = res.div(res.min()).T.plot(loglog=True)
                      ax.set_xlabel("N")
                      ax.set_ylabel("time (relative)")


                      Performance



                      enter image description here






                      share|improve this answer


























                        14





                        +100







                        14





                        +100



                        14




                        +100




                        Option 1



                        If all of the sublists in the other column are the same length, numpy can be an efficient option here:



                        vals = np.array(df.B.values.tolist())    
                        a = np.repeat(df.A, vals.shape[1])

                        pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)




                           A  B
                        0 1 1
                        1 1 2
                        2 2 1
                        3 2 2




                        Option 2



                        If the sublists have different length, you need an additional step:



                        vals = df.B.values.tolist()
                        rs = [len(r) for r in vals]
                        a = np.repeat(df.A, rs)

                        pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)




                           A  B
                        0 1 1
                        1 1 2
                        2 2 1
                        3 2 2




                        Option 3



                        I took a shot at generalizing this to work to flatten N columns and tile M columns, I'll work later on making it more efficient:



                        df = pd.DataFrame({'A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
                        'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C']})




                           A          B          C  D
                        0 1 [1, 2] [1, 2, 3] A
                        1 2 [1, 2, 3] [1, 2] B
                        2 3 [1] [1, 2] C




                        def unnest(df, tile, explode):
                        vals = df[explode].sum(1)
                        rs = [len(r) for r in vals]
                        a = np.repeat(df[tile].values, rs, axis=0)
                        b = np.concatenate(vals.values)
                        d = np.column_stack((a, b))
                        return pd.DataFrame(d, columns = tile + ['_'.join(explode)])

                        unnest(df, ['A', 'D'], ['B', 'C'])




                            A  D B_C
                        0 1 A 1
                        1 1 A 2
                        2 1 A 1
                        3 1 A 2
                        4 1 A 3
                        5 2 B 1
                        6 2 B 2
                        7 2 B 3
                        8 2 B 1
                        9 2 B 2
                        10 3 C 1
                        11 3 C 1
                        12 3 C 2




                        Functions



                        def wen1(df):
                        return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0: 'B'})

                        def wen2(df):
                        return pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})

                        def wen3(df):
                        s = pd.DataFrame({'B': np.concatenate(df.B.values)}, index=df.index.repeat(df.B.str.len()))
                        return s.join(df.drop('B', 1), how='left')

                        def wen4(df):
                        return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

                        def chris1(df):
                        vals = np.array(df.B.values.tolist())
                        a = np.repeat(df.A, vals.shape[1])
                        return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

                        def chris2(df):
                        vals = df.B.values.tolist()
                        rs = [len(r) for r in vals]
                        a = np.repeat(df.A.values, rs)
                        return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)


                        Timings



                        import pandas as pd
                        import matplotlib.pyplot as plt
                        import numpy as np
                        from timeit import timeit

                        res = pd.DataFrame(
                        index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
                        columns=[10, 50, 100, 500, 1000, 5000, 10000],
                        dtype=float
                        )

                        for f in res.index:
                        for c in res.columns:
                        df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
                        df = pd.concat([df]*c)
                        stmt = '{}(df)'.format(f)
                        setp = 'from __main__ import df, {}'.format(f)
                        res.at[f, c] = timeit(stmt, setp, number=50)

                        ax = res.div(res.min()).T.plot(loglog=True)
                        ax.set_xlabel("N")
                        ax.set_ylabel("time (relative)")


                        Performance



                        enter image description here






                        share|improve this answer














                        Option 1



                        If all of the sublists in the other column are the same length, numpy can be an efficient option here:



                        vals = np.array(df.B.values.tolist())    
                        a = np.repeat(df.A, vals.shape[1])

                        pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)




                           A  B
                        0 1 1
                        1 1 2
                        2 2 1
                        3 2 2




                        Option 2



                        If the sublists have different length, you need an additional step:



                        vals = df.B.values.tolist()
                        rs = [len(r) for r in vals]
                        a = np.repeat(df.A, rs)

                        pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)




                           A  B
                        0 1 1
                        1 1 2
                        2 2 1
                        3 2 2




                        Option 3



                        I took a shot at generalizing this to work to flatten N columns and tile M columns, I'll work later on making it more efficient:



                        df = pd.DataFrame({'A': [1,2,3], 'B': [[1,2], [1,2,3], [1]],
                        'C': [[1,2,3], [1,2], [1,2]], 'D': ['A', 'B', 'C']})




                           A          B          C  D
                        0 1 [1, 2] [1, 2, 3] A
                        1 2 [1, 2, 3] [1, 2] B
                        2 3 [1] [1, 2] C




                        def unnest(df, tile, explode):
                        vals = df[explode].sum(1)
                        rs = [len(r) for r in vals]
                        a = np.repeat(df[tile].values, rs, axis=0)
                        b = np.concatenate(vals.values)
                        d = np.column_stack((a, b))
                        return pd.DataFrame(d, columns = tile + ['_'.join(explode)])

                        unnest(df, ['A', 'D'], ['B', 'C'])




                            A  D B_C
                        0 1 A 1
                        1 1 A 2
                        2 1 A 1
                        3 1 A 2
                        4 1 A 3
                        5 2 B 1
                        6 2 B 2
                        7 2 B 3
                        8 2 B 1
                        9 2 B 2
                        10 3 C 1
                        11 3 C 1
                        12 3 C 2




                        Functions



                        def wen1(df):
                        return df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0: 'B'})

                        def wen2(df):
                        return pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})

                        def wen3(df):
                        s = pd.DataFrame({'B': np.concatenate(df.B.values)}, index=df.index.repeat(df.B.str.len()))
                        return s.join(df.drop('B', 1), how='left')

                        def wen4(df):
                        return pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)

                        def chris1(df):
                        vals = np.array(df.B.values.tolist())
                        a = np.repeat(df.A, vals.shape[1])
                        return pd.DataFrame(np.column_stack((a, vals.ravel())), columns=df.columns)

                        def chris2(df):
                        vals = df.B.values.tolist()
                        rs = [len(r) for r in vals]
                        a = np.repeat(df.A.values, rs)
                        return pd.DataFrame(np.column_stack((a, np.concatenate(vals))), columns=df.columns)


                        Timings



                        import pandas as pd
                        import matplotlib.pyplot as plt
                        import numpy as np
                        from timeit import timeit

                        res = pd.DataFrame(
                        index=['wen1', 'wen2', 'wen3', 'wen4', 'chris1', 'chris2'],
                        columns=[10, 50, 100, 500, 1000, 5000, 10000],
                        dtype=float
                        )

                        for f in res.index:
                        for c in res.columns:
                        df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
                        df = pd.concat([df]*c)
                        stmt = '{}(df)'.format(f)
                        setp = 'from __main__ import df, {}'.format(f)
                        res.at[f, c] = timeit(stmt, setp, number=50)

                        ax = res.div(res.min()).T.plot(loglog=True)
                        ax.set_xlabel("N")
                        ax.set_ylabel("time (relative)")


                        Performance



                        enter image description here







                        share|improve this answer














                        share|improve this answer



                        share|improve this answer








                        edited Nov 9 '18 at 4:15

























                        answered Nov 9 '18 at 2:35









                        user3483203

                        30.3k82354




                        30.3k82354























                            7














                            One alternative is to apply the meshgrid recipe over the rows of the columns to unnest:



                            import numpy as np
                            import pandas as pd


                            def unnest(frame, explode):
                            def mesh(values):
                            return np.array(np.meshgrid(*values)).T.reshape(-1, len(values))

                            data = np.vstack(mesh(row) for row in frame[explode].values)
                            return pd.DataFrame(data=data, columns=explode)


                            df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
                            print(unnest(df, ['A', 'B'])) # base
                            print()

                            df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [3, 4]], 'C': [[1, 2], [3, 4]]})
                            print(unnest(df, ['A', 'B', 'C'])) # multiple columns
                            print()

                            df = pd.DataFrame({'A': [1, 2, 3], 'B': [[1, 2], [1, 2, 3], [1]],
                            'C': [[1, 2, 3], [1, 2], [1, 2]], 'D': ['A', 'B', 'C']})

                            print(unnest(df, ['A', 'B'])) # uneven length lists
                            print()
                            print(unnest(df, ['D', 'B'])) # different types
                            print()


                            Output



                               A  B
                            0 1 1
                            1 1 2
                            2 2 1
                            3 2 2

                            A B C
                            0 1 1 1
                            1 1 2 1
                            2 1 1 2
                            3 1 2 2
                            4 2 3 3
                            5 2 4 3
                            6 2 3 4
                            7 2 4 4

                            A B
                            0 1 1
                            1 1 2
                            2 2 1
                            3 2 2
                            4 2 3
                            5 3 1

                            D B
                            0 A 1
                            1 A 2
                            2 B 1
                            3 B 2
                            4 B 3
                            5 C 1





                            share|improve this answer


























                              7














                              One alternative is to apply the meshgrid recipe over the rows of the columns to unnest:



                              import numpy as np
                              import pandas as pd


                              def unnest(frame, explode):
                              def mesh(values):
                              return np.array(np.meshgrid(*values)).T.reshape(-1, len(values))

                              data = np.vstack(mesh(row) for row in frame[explode].values)
                              return pd.DataFrame(data=data, columns=explode)


                              df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
                              print(unnest(df, ['A', 'B'])) # base
                              print()

                              df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [3, 4]], 'C': [[1, 2], [3, 4]]})
                              print(unnest(df, ['A', 'B', 'C'])) # multiple columns
                              print()

                              df = pd.DataFrame({'A': [1, 2, 3], 'B': [[1, 2], [1, 2, 3], [1]],
                              'C': [[1, 2, 3], [1, 2], [1, 2]], 'D': ['A', 'B', 'C']})

                              print(unnest(df, ['A', 'B'])) # uneven length lists
                              print()
                              print(unnest(df, ['D', 'B'])) # different types
                              print()


                              Output



                                 A  B
                              0 1 1
                              1 1 2
                              2 2 1
                              3 2 2

                              A B C
                              0 1 1 1
                              1 1 2 1
                              2 1 1 2
                              3 1 2 2
                              4 2 3 3
                              5 2 4 3
                              6 2 3 4
                              7 2 4 4

                              A B
                              0 1 1
                              1 1 2
                              2 2 1
                              3 2 2
                              4 2 3
                              5 3 1

                              D B
                              0 A 1
                              1 A 2
                              2 B 1
                              3 B 2
                              4 B 3
                              5 C 1





                              share|improve this answer
























                                7












                                7








                                7






                                One alternative is to apply the meshgrid recipe over the rows of the columns to unnest:



                                import numpy as np
                                import pandas as pd


                                def unnest(frame, explode):
                                def mesh(values):
                                return np.array(np.meshgrid(*values)).T.reshape(-1, len(values))

                                data = np.vstack(mesh(row) for row in frame[explode].values)
                                return pd.DataFrame(data=data, columns=explode)


                                df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
                                print(unnest(df, ['A', 'B'])) # base
                                print()

                                df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [3, 4]], 'C': [[1, 2], [3, 4]]})
                                print(unnest(df, ['A', 'B', 'C'])) # multiple columns
                                print()

                                df = pd.DataFrame({'A': [1, 2, 3], 'B': [[1, 2], [1, 2, 3], [1]],
                                'C': [[1, 2, 3], [1, 2], [1, 2]], 'D': ['A', 'B', 'C']})

                                print(unnest(df, ['A', 'B'])) # uneven length lists
                                print()
                                print(unnest(df, ['D', 'B'])) # different types
                                print()


                                Output



                                   A  B
                                0 1 1
                                1 1 2
                                2 2 1
                                3 2 2

                                A B C
                                0 1 1 1
                                1 1 2 1
                                2 1 1 2
                                3 1 2 2
                                4 2 3 3
                                5 2 4 3
                                6 2 3 4
                                7 2 4 4

                                A B
                                0 1 1
                                1 1 2
                                2 2 1
                                3 2 2
                                4 2 3
                                5 3 1

                                D B
                                0 A 1
                                1 A 2
                                2 B 1
                                3 B 2
                                4 B 3
                                5 C 1





                                share|improve this answer












                                One alternative is to apply the meshgrid recipe over the rows of the columns to unnest:



                                import numpy as np
                                import pandas as pd


                                def unnest(frame, explode):
                                def mesh(values):
                                return np.array(np.meshgrid(*values)).T.reshape(-1, len(values))

                                data = np.vstack(mesh(row) for row in frame[explode].values)
                                return pd.DataFrame(data=data, columns=explode)


                                df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [1, 2]]})
                                print(unnest(df, ['A', 'B'])) # base
                                print()

                                df = pd.DataFrame({'A': [1, 2], 'B': [[1, 2], [3, 4]], 'C': [[1, 2], [3, 4]]})
                                print(unnest(df, ['A', 'B', 'C'])) # multiple columns
                                print()

                                df = pd.DataFrame({'A': [1, 2, 3], 'B': [[1, 2], [1, 2, 3], [1]],
                                'C': [[1, 2, 3], [1, 2], [1, 2]], 'D': ['A', 'B', 'C']})

                                print(unnest(df, ['A', 'B'])) # uneven length lists
                                print()
                                print(unnest(df, ['D', 'B'])) # different types
                                print()


                                Output



                                   A  B
                                0 1 1
                                1 1 2
                                2 2 1
                                3 2 2

                                A B C
                                0 1 1 1
                                1 1 2 1
                                2 1 1 2
                                3 1 2 2
                                4 2 3 3
                                5 2 4 3
                                6 2 3 4
                                7 2 4 4

                                A B
                                0 1 1
                                1 1 2
                                2 2 1
                                3 2 2
                                4 2 3
                                5 3 1

                                D B
                                0 A 1
                                1 A 2
                                2 B 1
                                3 B 2
                                4 B 3
                                5 C 1






                                share|improve this answer












                                share|improve this answer



                                share|improve this answer










                                answered Dec 1 '18 at 1:31









                                Daniel Mesejo

                                14.3k11027




                                14.3k11027























                                    1














                                    My 5 cents:



                                    df[['B', 'B2']] = pd.DataFrame(df['B'].values.tolist())

                                    df[['A', 'B']].append(df[['A', 'B2']].rename(columns={'B2': 'B'}),
                                    ignore_index=True)


                                    and another 5



                                    df[['B1', 'B2']] = pd.DataFrame([*df['B']]) # if values.tolist() is too boring

                                    (pd.wide_to_long(df.drop('B', 1), 'B', 'A', '')
                                    .reset_index(level=1, drop=True)
                                    .reset_index())


                                    both resulting in the same



                                       A  B
                                    0 1 1
                                    1 2 1
                                    2 1 2
                                    3 2 2





                                    share|improve this answer




























                                      1














                                      My 5 cents:



                                      df[['B', 'B2']] = pd.DataFrame(df['B'].values.tolist())

                                      df[['A', 'B']].append(df[['A', 'B2']].rename(columns={'B2': 'B'}),
                                      ignore_index=True)


                                      and another 5



                                      df[['B1', 'B2']] = pd.DataFrame([*df['B']]) # if values.tolist() is too boring

                                      (pd.wide_to_long(df.drop('B', 1), 'B', 'A', '')
                                      .reset_index(level=1, drop=True)
                                      .reset_index())


                                      both resulting in the same



                                         A  B
                                      0 1 1
                                      1 2 1
                                      2 1 2
                                      3 2 2





                                      share|improve this answer


























                                        1












                                        1








                                        1






                                        My 5 cents:



                                        df[['B', 'B2']] = pd.DataFrame(df['B'].values.tolist())

                                        df[['A', 'B']].append(df[['A', 'B2']].rename(columns={'B2': 'B'}),
                                        ignore_index=True)


                                        and another 5



                                        df[['B1', 'B2']] = pd.DataFrame([*df['B']]) # if values.tolist() is too boring

                                        (pd.wide_to_long(df.drop('B', 1), 'B', 'A', '')
                                        .reset_index(level=1, drop=True)
                                        .reset_index())


                                        both resulting in the same



                                           A  B
                                        0 1 1
                                        1 2 1
                                        2 1 2
                                        3 2 2





                                        share|improve this answer














                                        My 5 cents:



                                        df[['B', 'B2']] = pd.DataFrame(df['B'].values.tolist())

                                        df[['A', 'B']].append(df[['A', 'B2']].rename(columns={'B2': 'B'}),
                                        ignore_index=True)


                                        and another 5



                                        df[['B1', 'B2']] = pd.DataFrame([*df['B']]) # if values.tolist() is too boring

                                        (pd.wide_to_long(df.drop('B', 1), 'B', 'A', '')
                                        .reset_index(level=1, drop=True)
                                        .reset_index())


                                        both resulting in the same



                                           A  B
                                        0 1 1
                                        1 2 1
                                        2 1 2
                                        3 2 2






                                        share|improve this answer














                                        share|improve this answer



                                        share|improve this answer








                                        edited Dec 11 '18 at 3:50

























                                        answered Dec 11 '18 at 2:05









                                        ayorgo

                                        1,110514




                                        1,110514























                                            -1














                                            Something pretty not recommended (at least work in this case):



                                            df=pd.concat([df]*2).sort_index()
                                            it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
                                            df['B']=df['B'].apply(lambda x:next(it))


                                            concat + sort_index + iter + apply + next.



                                            Now:



                                            print(df)


                                            Is:



                                               A  B
                                            0 1 1
                                            0 1 2
                                            1 2 1
                                            1 2 2


                                            If care about index:



                                            df=df.reset_index(drop=True)


                                            Now:



                                            print(df)


                                            Is:



                                               A  B
                                            0 1 1
                                            1 1 2
                                            2 2 1
                                            3 2 2





                                            share|improve this answer


























                                              -1














                                              Something pretty not recommended (at least work in this case):



                                              df=pd.concat([df]*2).sort_index()
                                              it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
                                              df['B']=df['B'].apply(lambda x:next(it))


                                              concat + sort_index + iter + apply + next.



                                              Now:



                                              print(df)


                                              Is:



                                                 A  B
                                              0 1 1
                                              0 1 2
                                              1 2 1
                                              1 2 2


                                              If care about index:



                                              df=df.reset_index(drop=True)


                                              Now:



                                              print(df)


                                              Is:



                                                 A  B
                                              0 1 1
                                              1 1 2
                                              2 2 1
                                              3 2 2





                                              share|improve this answer
























                                                -1












                                                -1








                                                -1






                                                Something pretty not recommended (at least work in this case):



                                                df=pd.concat([df]*2).sort_index()
                                                it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
                                                df['B']=df['B'].apply(lambda x:next(it))


                                                concat + sort_index + iter + apply + next.



                                                Now:



                                                print(df)


                                                Is:



                                                   A  B
                                                0 1 1
                                                0 1 2
                                                1 2 1
                                                1 2 2


                                                If care about index:



                                                df=df.reset_index(drop=True)


                                                Now:



                                                print(df)


                                                Is:



                                                   A  B
                                                0 1 1
                                                1 1 2
                                                2 2 1
                                                3 2 2





                                                share|improve this answer












                                                Something pretty not recommended (at least work in this case):



                                                df=pd.concat([df]*2).sort_index()
                                                it=iter(df['B'].tolist()[0]+df['B'].tolist()[0])
                                                df['B']=df['B'].apply(lambda x:next(it))


                                                concat + sort_index + iter + apply + next.



                                                Now:



                                                print(df)


                                                Is:



                                                   A  B
                                                0 1 1
                                                0 1 2
                                                1 2 1
                                                1 2 2


                                                If care about index:



                                                df=df.reset_index(drop=True)


                                                Now:



                                                print(df)


                                                Is:



                                                   A  B
                                                0 1 1
                                                1 1 2
                                                2 2 1
                                                3 2 2






                                                share|improve this answer












                                                share|improve this answer



                                                share|improve this answer










                                                answered Nov 9 '18 at 2:40









                                                U9-Forward

                                                13.3k21237




                                                13.3k21237






























                                                    draft saved

                                                    draft discarded




















































                                                    Thanks for contributing an answer to Stack Overflow!


                                                    • Please be sure to answer the question. Provide details and share your research!

                                                    But avoid



                                                    • Asking for help, clarification, or responding to other answers.

                                                    • Making statements based on opinion; back them up with references or personal experience.


                                                    To learn more, see our tips on writing great answers.





                                                    Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                                                    Please pay close attention to the following guidance:


                                                    • Please be sure to answer the question. Provide details and share your research!

                                                    But avoid



                                                    • Asking for help, clarification, or responding to other answers.

                                                    • Making statements based on opinion; back them up with references or personal experience.


                                                    To learn more, see our tips on writing great answers.




                                                    draft saved


                                                    draft discarded














                                                    StackExchange.ready(
                                                    function () {
                                                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53218931%2fhow-do-i-unnest-explode-a-column-in-a-pandas-dataframe%23new-answer', 'question_page');
                                                    }
                                                    );

                                                    Post as a guest















                                                    Required, but never shown





















































                                                    Required, but never shown














                                                    Required, but never shown












                                                    Required, but never shown







                                                    Required, but never shown

































                                                    Required, but never shown














                                                    Required, but never shown












                                                    Required, but never shown







                                                    Required, but never shown







                                                    Popular posts from this blog

                                                    404 Error Contact Form 7 ajax form submitting

                                                    How to know if a Active Directory user can login interactively

                                                    TypeError: fit_transform() missing 1 required positional argument: 'X'