Parallel excel sheet read from dask












2
















Hello All the examples that I came across for using dask thus far has
been multiple csv files in a folder being read using dask read_csv
call.



if I am provided an xlsx file with multiple tabs, can I use anything
in dask to read them parallely?




P.S. I am using pandas 0.19.2 with python 2.7










share|improve this question


















  • 1





    You would be best to write a function to read one tab (taking the tab ID as input), and look into dask's delayed function. Are you wanting to process all the tabs as a single data-frame?

    – mdurant
    Jun 20 '17 at 14:50






  • 1





    This notebook may be of interest: gist.github.com/mrocklin/e7b7b3a65f2835cda813096332ec73ca

    – MRocklin
    Jun 21 '17 at 4:50
















2
















Hello All the examples that I came across for using dask thus far has
been multiple csv files in a folder being read using dask read_csv
call.



if I am provided an xlsx file with multiple tabs, can I use anything
in dask to read them parallely?




P.S. I am using pandas 0.19.2 with python 2.7










share|improve this question


















  • 1





    You would be best to write a function to read one tab (taking the tab ID as input), and look into dask's delayed function. Are you wanting to process all the tabs as a single data-frame?

    – mdurant
    Jun 20 '17 at 14:50






  • 1





    This notebook may be of interest: gist.github.com/mrocklin/e7b7b3a65f2835cda813096332ec73ca

    – MRocklin
    Jun 21 '17 at 4:50














2












2








2


1







Hello All the examples that I came across for using dask thus far has
been multiple csv files in a folder being read using dask read_csv
call.



if I am provided an xlsx file with multiple tabs, can I use anything
in dask to read them parallely?




P.S. I am using pandas 0.19.2 with python 2.7










share|improve this question















Hello All the examples that I came across for using dask thus far has
been multiple csv files in a folder being read using dask read_csv
call.



if I am provided an xlsx file with multiple tabs, can I use anything
in dask to read them parallely?




P.S. I am using pandas 0.19.2 with python 2.7







python-2.7 dask






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jun 20 '17 at 13:47









schulerschuler

45110




45110








  • 1





    You would be best to write a function to read one tab (taking the tab ID as input), and look into dask's delayed function. Are you wanting to process all the tabs as a single data-frame?

    – mdurant
    Jun 20 '17 at 14:50






  • 1





    This notebook may be of interest: gist.github.com/mrocklin/e7b7b3a65f2835cda813096332ec73ca

    – MRocklin
    Jun 21 '17 at 4:50














  • 1





    You would be best to write a function to read one tab (taking the tab ID as input), and look into dask's delayed function. Are you wanting to process all the tabs as a single data-frame?

    – mdurant
    Jun 20 '17 at 14:50






  • 1





    This notebook may be of interest: gist.github.com/mrocklin/e7b7b3a65f2835cda813096332ec73ca

    – MRocklin
    Jun 21 '17 at 4:50








1




1





You would be best to write a function to read one tab (taking the tab ID as input), and look into dask's delayed function. Are you wanting to process all the tabs as a single data-frame?

– mdurant
Jun 20 '17 at 14:50





You would be best to write a function to read one tab (taking the tab ID as input), and look into dask's delayed function. Are you wanting to process all the tabs as a single data-frame?

– mdurant
Jun 20 '17 at 14:50




1




1





This notebook may be of interest: gist.github.com/mrocklin/e7b7b3a65f2835cda813096332ec73ca

– MRocklin
Jun 21 '17 at 4:50





This notebook may be of interest: gist.github.com/mrocklin/e7b7b3a65f2835cda813096332ec73ca

– MRocklin
Jun 21 '17 at 4:50












2 Answers
2






active

oldest

votes


















2














A simple example



fn = 'my_file.xlsx'
parts = dask.delayed(pd.read_excel)(fn, i, **other_options) for i in range(number_of_sheets)
df = dd.from_delayed(parts, meta=parts[0].compute())


Assuming you provide the "other options" to extract the data (which is uniform across sheets) and you want to make a single master data-frame out of the set.



Note that I don't know the internals of the excel reader, so how parallel the reading/parsing part would be is uncertain, but subsequent computations once the data are in memory would definitely be.






share|improve this answer































    1














    For those using Python 3.6:



    #reading the file using dask
    import dask
    import dask.dataframe as dd
    from dask.delayed import delayed

    parts = dask.delayed(pd.read_excel)(excel_file, sheet_name=0, usecols = [1, 2, 7])
    df = dd.from_delayed(parts)

    print(df.head())


    I'm seeing a 50% speed increase on load on a i7, 16GB 5th Gen machine.






    share|improve this answer























      Your Answer






      StackExchange.ifUsing("editor", function () {
      StackExchange.using("externalEditor", function () {
      StackExchange.using("snippets", function () {
      StackExchange.snippets.init();
      });
      });
      }, "code-snippets");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "1"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: true,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: 10,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f44654906%2fparallel-excel-sheet-read-from-dask%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      2 Answers
      2






      active

      oldest

      votes








      2 Answers
      2






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      2














      A simple example



      fn = 'my_file.xlsx'
      parts = dask.delayed(pd.read_excel)(fn, i, **other_options) for i in range(number_of_sheets)
      df = dd.from_delayed(parts, meta=parts[0].compute())


      Assuming you provide the "other options" to extract the data (which is uniform across sheets) and you want to make a single master data-frame out of the set.



      Note that I don't know the internals of the excel reader, so how parallel the reading/parsing part would be is uncertain, but subsequent computations once the data are in memory would definitely be.






      share|improve this answer




























        2














        A simple example



        fn = 'my_file.xlsx'
        parts = dask.delayed(pd.read_excel)(fn, i, **other_options) for i in range(number_of_sheets)
        df = dd.from_delayed(parts, meta=parts[0].compute())


        Assuming you provide the "other options" to extract the data (which is uniform across sheets) and you want to make a single master data-frame out of the set.



        Note that I don't know the internals of the excel reader, so how parallel the reading/parsing part would be is uncertain, but subsequent computations once the data are in memory would definitely be.






        share|improve this answer


























          2












          2








          2







          A simple example



          fn = 'my_file.xlsx'
          parts = dask.delayed(pd.read_excel)(fn, i, **other_options) for i in range(number_of_sheets)
          df = dd.from_delayed(parts, meta=parts[0].compute())


          Assuming you provide the "other options" to extract the data (which is uniform across sheets) and you want to make a single master data-frame out of the set.



          Note that I don't know the internals of the excel reader, so how parallel the reading/parsing part would be is uncertain, but subsequent computations once the data are in memory would definitely be.






          share|improve this answer













          A simple example



          fn = 'my_file.xlsx'
          parts = dask.delayed(pd.read_excel)(fn, i, **other_options) for i in range(number_of_sheets)
          df = dd.from_delayed(parts, meta=parts[0].compute())


          Assuming you provide the "other options" to extract the data (which is uniform across sheets) and you want to make a single master data-frame out of the set.



          Note that I don't know the internals of the excel reader, so how parallel the reading/parsing part would be is uncertain, but subsequent computations once the data are in memory would definitely be.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Jun 21 '17 at 14:55









          mdurantmdurant

          10.3k11436




          10.3k11436

























              1














              For those using Python 3.6:



              #reading the file using dask
              import dask
              import dask.dataframe as dd
              from dask.delayed import delayed

              parts = dask.delayed(pd.read_excel)(excel_file, sheet_name=0, usecols = [1, 2, 7])
              df = dd.from_delayed(parts)

              print(df.head())


              I'm seeing a 50% speed increase on load on a i7, 16GB 5th Gen machine.






              share|improve this answer




























                1














                For those using Python 3.6:



                #reading the file using dask
                import dask
                import dask.dataframe as dd
                from dask.delayed import delayed

                parts = dask.delayed(pd.read_excel)(excel_file, sheet_name=0, usecols = [1, 2, 7])
                df = dd.from_delayed(parts)

                print(df.head())


                I'm seeing a 50% speed increase on load on a i7, 16GB 5th Gen machine.






                share|improve this answer


























                  1












                  1








                  1







                  For those using Python 3.6:



                  #reading the file using dask
                  import dask
                  import dask.dataframe as dd
                  from dask.delayed import delayed

                  parts = dask.delayed(pd.read_excel)(excel_file, sheet_name=0, usecols = [1, 2, 7])
                  df = dd.from_delayed(parts)

                  print(df.head())


                  I'm seeing a 50% speed increase on load on a i7, 16GB 5th Gen machine.






                  share|improve this answer













                  For those using Python 3.6:



                  #reading the file using dask
                  import dask
                  import dask.dataframe as dd
                  from dask.delayed import delayed

                  parts = dask.delayed(pd.read_excel)(excel_file, sheet_name=0, usecols = [1, 2, 7])
                  df = dd.from_delayed(parts)

                  print(df.head())


                  I'm seeing a 50% speed increase on load on a i7, 16GB 5th Gen machine.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 23 '18 at 11:25









                  zorzezorze

                  815




                  815






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Stack Overflow!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f44654906%2fparallel-excel-sheet-read-from-dask%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Feedback on college project

                      Futebolista

                      Albești (Vaslui)