Parallel excel sheet read from dask
Hello All the examples that I came across for using dask thus far has
been multiple csv files in a folder being read using dask read_csv
call.
if I am provided an xlsx file with multiple tabs, can I use anything
in dask to read them parallely?
P.S. I am using pandas 0.19.2 with python 2.7
python-2.7 dask
add a comment |
Hello All the examples that I came across for using dask thus far has
been multiple csv files in a folder being read using dask read_csv
call.
if I am provided an xlsx file with multiple tabs, can I use anything
in dask to read them parallely?
P.S. I am using pandas 0.19.2 with python 2.7
python-2.7 dask
1
You would be best to write a function to read one tab (taking the tab ID as input), and look into dask'sdelayedfunction. Are you wanting to process all the tabs as a single data-frame?
– mdurant
Jun 20 '17 at 14:50
1
This notebook may be of interest: gist.github.com/mrocklin/e7b7b3a65f2835cda813096332ec73ca
– MRocklin
Jun 21 '17 at 4:50
add a comment |
Hello All the examples that I came across for using dask thus far has
been multiple csv files in a folder being read using dask read_csv
call.
if I am provided an xlsx file with multiple tabs, can I use anything
in dask to read them parallely?
P.S. I am using pandas 0.19.2 with python 2.7
python-2.7 dask
Hello All the examples that I came across for using dask thus far has
been multiple csv files in a folder being read using dask read_csv
call.
if I am provided an xlsx file with multiple tabs, can I use anything
in dask to read them parallely?
P.S. I am using pandas 0.19.2 with python 2.7
python-2.7 dask
python-2.7 dask
asked Jun 20 '17 at 13:47
schulerschuler
45110
45110
1
You would be best to write a function to read one tab (taking the tab ID as input), and look into dask'sdelayedfunction. Are you wanting to process all the tabs as a single data-frame?
– mdurant
Jun 20 '17 at 14:50
1
This notebook may be of interest: gist.github.com/mrocklin/e7b7b3a65f2835cda813096332ec73ca
– MRocklin
Jun 21 '17 at 4:50
add a comment |
1
You would be best to write a function to read one tab (taking the tab ID as input), and look into dask'sdelayedfunction. Are you wanting to process all the tabs as a single data-frame?
– mdurant
Jun 20 '17 at 14:50
1
This notebook may be of interest: gist.github.com/mrocklin/e7b7b3a65f2835cda813096332ec73ca
– MRocklin
Jun 21 '17 at 4:50
1
1
You would be best to write a function to read one tab (taking the tab ID as input), and look into dask's
delayed function. Are you wanting to process all the tabs as a single data-frame?– mdurant
Jun 20 '17 at 14:50
You would be best to write a function to read one tab (taking the tab ID as input), and look into dask's
delayed function. Are you wanting to process all the tabs as a single data-frame?– mdurant
Jun 20 '17 at 14:50
1
1
This notebook may be of interest: gist.github.com/mrocklin/e7b7b3a65f2835cda813096332ec73ca
– MRocklin
Jun 21 '17 at 4:50
This notebook may be of interest: gist.github.com/mrocklin/e7b7b3a65f2835cda813096332ec73ca
– MRocklin
Jun 21 '17 at 4:50
add a comment |
2 Answers
2
active
oldest
votes
A simple example
fn = 'my_file.xlsx'
parts = dask.delayed(pd.read_excel)(fn, i, **other_options) for i in range(number_of_sheets)
df = dd.from_delayed(parts, meta=parts[0].compute())
Assuming you provide the "other options" to extract the data (which is uniform across sheets) and you want to make a single master data-frame out of the set.
Note that I don't know the internals of the excel reader, so how parallel the reading/parsing part would be is uncertain, but subsequent computations once the data are in memory would definitely be.
add a comment |
For those using Python 3.6:
#reading the file using dask
import dask
import dask.dataframe as dd
from dask.delayed import delayed
parts = dask.delayed(pd.read_excel)(excel_file, sheet_name=0, usecols = [1, 2, 7])
df = dd.from_delayed(parts)
print(df.head())
I'm seeing a 50% speed increase on load on a i7, 16GB 5th Gen machine.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f44654906%2fparallel-excel-sheet-read-from-dask%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
A simple example
fn = 'my_file.xlsx'
parts = dask.delayed(pd.read_excel)(fn, i, **other_options) for i in range(number_of_sheets)
df = dd.from_delayed(parts, meta=parts[0].compute())
Assuming you provide the "other options" to extract the data (which is uniform across sheets) and you want to make a single master data-frame out of the set.
Note that I don't know the internals of the excel reader, so how parallel the reading/parsing part would be is uncertain, but subsequent computations once the data are in memory would definitely be.
add a comment |
A simple example
fn = 'my_file.xlsx'
parts = dask.delayed(pd.read_excel)(fn, i, **other_options) for i in range(number_of_sheets)
df = dd.from_delayed(parts, meta=parts[0].compute())
Assuming you provide the "other options" to extract the data (which is uniform across sheets) and you want to make a single master data-frame out of the set.
Note that I don't know the internals of the excel reader, so how parallel the reading/parsing part would be is uncertain, but subsequent computations once the data are in memory would definitely be.
add a comment |
A simple example
fn = 'my_file.xlsx'
parts = dask.delayed(pd.read_excel)(fn, i, **other_options) for i in range(number_of_sheets)
df = dd.from_delayed(parts, meta=parts[0].compute())
Assuming you provide the "other options" to extract the data (which is uniform across sheets) and you want to make a single master data-frame out of the set.
Note that I don't know the internals of the excel reader, so how parallel the reading/parsing part would be is uncertain, but subsequent computations once the data are in memory would definitely be.
A simple example
fn = 'my_file.xlsx'
parts = dask.delayed(pd.read_excel)(fn, i, **other_options) for i in range(number_of_sheets)
df = dd.from_delayed(parts, meta=parts[0].compute())
Assuming you provide the "other options" to extract the data (which is uniform across sheets) and you want to make a single master data-frame out of the set.
Note that I don't know the internals of the excel reader, so how parallel the reading/parsing part would be is uncertain, but subsequent computations once the data are in memory would definitely be.
answered Jun 21 '17 at 14:55
mdurantmdurant
10.3k11436
10.3k11436
add a comment |
add a comment |
For those using Python 3.6:
#reading the file using dask
import dask
import dask.dataframe as dd
from dask.delayed import delayed
parts = dask.delayed(pd.read_excel)(excel_file, sheet_name=0, usecols = [1, 2, 7])
df = dd.from_delayed(parts)
print(df.head())
I'm seeing a 50% speed increase on load on a i7, 16GB 5th Gen machine.
add a comment |
For those using Python 3.6:
#reading the file using dask
import dask
import dask.dataframe as dd
from dask.delayed import delayed
parts = dask.delayed(pd.read_excel)(excel_file, sheet_name=0, usecols = [1, 2, 7])
df = dd.from_delayed(parts)
print(df.head())
I'm seeing a 50% speed increase on load on a i7, 16GB 5th Gen machine.
add a comment |
For those using Python 3.6:
#reading the file using dask
import dask
import dask.dataframe as dd
from dask.delayed import delayed
parts = dask.delayed(pd.read_excel)(excel_file, sheet_name=0, usecols = [1, 2, 7])
df = dd.from_delayed(parts)
print(df.head())
I'm seeing a 50% speed increase on load on a i7, 16GB 5th Gen machine.
For those using Python 3.6:
#reading the file using dask
import dask
import dask.dataframe as dd
from dask.delayed import delayed
parts = dask.delayed(pd.read_excel)(excel_file, sheet_name=0, usecols = [1, 2, 7])
df = dd.from_delayed(parts)
print(df.head())
I'm seeing a 50% speed increase on load on a i7, 16GB 5th Gen machine.
answered Nov 23 '18 at 11:25
zorzezorze
815
815
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f44654906%2fparallel-excel-sheet-read-from-dask%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
You would be best to write a function to read one tab (taking the tab ID as input), and look into dask's
delayedfunction. Are you wanting to process all the tabs as a single data-frame?– mdurant
Jun 20 '17 at 14:50
1
This notebook may be of interest: gist.github.com/mrocklin/e7b7b3a65f2835cda813096332ec73ca
– MRocklin
Jun 21 '17 at 4:50