Why is it faster to read whole hdf5 dataset than a slice
I'm trying to figure out why this happens:
In [1]: import time, h5py as h5
In [2]: f = h5.File('myfile.hdf5', 'r')
In [3]: st = time.time(); data = f["data"].value[0,:,1,...]; elapsed = time.time() - st;
In [4]: elapsed
Out[4]: 11.127676010131836
In [5]: st = time.time(); data = f["data"][0,:,1,...]; elapsed2 = time.time() - st;
In [6]: elapsed2
Out[6]: 59.810582399368286
In [7]: f["data"].shape
Out[7]: (1, 4096, 6, 16, 16, 16, 16)
In [8]: f["data"].chunks
Out[8]: (1, 4096, 1, 16, 16, 16, 16)
As you can see, loading the whole dataset into memory and then taking a slice is faster than taking that same slice from the dataset.
The chunk size matches the slice, so it should all be contiguous memory, right? Why then is it so much slower?
The dataset is compressed with gzip (opts=2
).
Following Andrew's comment, I run it clearing the caches between both reads:
elapsed1: 11.001180410385132
elapsed2: 43.19723725318909
48.61user 4.45system 0:54.65elapsed 97%CPU (0avgtext+0avgdata 8431596maxresident)k
479584inputs+0outputs (106major+3764414minor)pagefaults 0swaps
(This next run had a 10 second delay between the two reads to clear the caches)
elapsed1: 11.46790862083435
elapsed2: 43.438515186309814
48.54user 4.66system 1:05.71elapsed 80%CPU (0avgtext+0avgdata 8431944maxresident)k
732504inputs+0outputs (220major+3764449minor)pagefaults 0swaps
python io hdf5 h5py
add a comment |
I'm trying to figure out why this happens:
In [1]: import time, h5py as h5
In [2]: f = h5.File('myfile.hdf5', 'r')
In [3]: st = time.time(); data = f["data"].value[0,:,1,...]; elapsed = time.time() - st;
In [4]: elapsed
Out[4]: 11.127676010131836
In [5]: st = time.time(); data = f["data"][0,:,1,...]; elapsed2 = time.time() - st;
In [6]: elapsed2
Out[6]: 59.810582399368286
In [7]: f["data"].shape
Out[7]: (1, 4096, 6, 16, 16, 16, 16)
In [8]: f["data"].chunks
Out[8]: (1, 4096, 1, 16, 16, 16, 16)
As you can see, loading the whole dataset into memory and then taking a slice is faster than taking that same slice from the dataset.
The chunk size matches the slice, so it should all be contiguous memory, right? Why then is it so much slower?
The dataset is compressed with gzip (opts=2
).
Following Andrew's comment, I run it clearing the caches between both reads:
elapsed1: 11.001180410385132
elapsed2: 43.19723725318909
48.61user 4.45system 0:54.65elapsed 97%CPU (0avgtext+0avgdata 8431596maxresident)k
479584inputs+0outputs (106major+3764414minor)pagefaults 0swaps
(This next run had a 10 second delay between the two reads to clear the caches)
elapsed1: 11.46790862083435
elapsed2: 43.438515186309814
48.54user 4.66system 1:05.71elapsed 80%CPU (0avgtext+0avgdata 8431944maxresident)k
732504inputs+0outputs (220major+3764449minor)pagefaults 0swaps
python io hdf5 h5py
What's your OS? If it's on Linux, try running the different versions of your Python code under/usr/bin/time /your/python/here
. That will show where the CPU time is spent - kernel/system or user-space, and provide some clue as to what's going on. Also on Linux, you can usestrace
to see what system calls are made in both versions. I suspect the "read it all into memory" unzips the data only once, while the slower method has to seek and unzip multiple times.
– Andrew Henle
Nov 26 '18 at 20:55
Also, if you're on LInux, flush your page cache before running each of your different versions
– Andrew Henle
Nov 26 '18 at 20:58
2
You are using fancy slicing. stackoverflow.com/a/48405220/4045774 I mentioned this case in the "simplest form of fancy indexing ". Your chunks are also extremely large. This isn't realy necessary if you use a proper chunk-cache size.
– max9111
Nov 27 '18 at 9:01
There are also much faster compression algorithms available than gzip. If portability between between multiple languages (gzip comes with every HDF5 installation, unlike the much faster blosc) isn't an issue, consider blosc. If your data offers a high compression ratio and stored on a HDD, it is often faster than a uncompressed dataset. eg. 900MB/s from a HDD -> stackoverflow.com/a/48997927/4045774
– max9111
Nov 27 '18 at 9:08
I wonder howdata[0][:,1]
does.
– hpaulj
Nov 30 '18 at 16:56
add a comment |
I'm trying to figure out why this happens:
In [1]: import time, h5py as h5
In [2]: f = h5.File('myfile.hdf5', 'r')
In [3]: st = time.time(); data = f["data"].value[0,:,1,...]; elapsed = time.time() - st;
In [4]: elapsed
Out[4]: 11.127676010131836
In [5]: st = time.time(); data = f["data"][0,:,1,...]; elapsed2 = time.time() - st;
In [6]: elapsed2
Out[6]: 59.810582399368286
In [7]: f["data"].shape
Out[7]: (1, 4096, 6, 16, 16, 16, 16)
In [8]: f["data"].chunks
Out[8]: (1, 4096, 1, 16, 16, 16, 16)
As you can see, loading the whole dataset into memory and then taking a slice is faster than taking that same slice from the dataset.
The chunk size matches the slice, so it should all be contiguous memory, right? Why then is it so much slower?
The dataset is compressed with gzip (opts=2
).
Following Andrew's comment, I run it clearing the caches between both reads:
elapsed1: 11.001180410385132
elapsed2: 43.19723725318909
48.61user 4.45system 0:54.65elapsed 97%CPU (0avgtext+0avgdata 8431596maxresident)k
479584inputs+0outputs (106major+3764414minor)pagefaults 0swaps
(This next run had a 10 second delay between the two reads to clear the caches)
elapsed1: 11.46790862083435
elapsed2: 43.438515186309814
48.54user 4.66system 1:05.71elapsed 80%CPU (0avgtext+0avgdata 8431944maxresident)k
732504inputs+0outputs (220major+3764449minor)pagefaults 0swaps
python io hdf5 h5py
I'm trying to figure out why this happens:
In [1]: import time, h5py as h5
In [2]: f = h5.File('myfile.hdf5', 'r')
In [3]: st = time.time(); data = f["data"].value[0,:,1,...]; elapsed = time.time() - st;
In [4]: elapsed
Out[4]: 11.127676010131836
In [5]: st = time.time(); data = f["data"][0,:,1,...]; elapsed2 = time.time() - st;
In [6]: elapsed2
Out[6]: 59.810582399368286
In [7]: f["data"].shape
Out[7]: (1, 4096, 6, 16, 16, 16, 16)
In [8]: f["data"].chunks
Out[8]: (1, 4096, 1, 16, 16, 16, 16)
As you can see, loading the whole dataset into memory and then taking a slice is faster than taking that same slice from the dataset.
The chunk size matches the slice, so it should all be contiguous memory, right? Why then is it so much slower?
The dataset is compressed with gzip (opts=2
).
Following Andrew's comment, I run it clearing the caches between both reads:
elapsed1: 11.001180410385132
elapsed2: 43.19723725318909
48.61user 4.45system 0:54.65elapsed 97%CPU (0avgtext+0avgdata 8431596maxresident)k
479584inputs+0outputs (106major+3764414minor)pagefaults 0swaps
(This next run had a 10 second delay between the two reads to clear the caches)
elapsed1: 11.46790862083435
elapsed2: 43.438515186309814
48.54user 4.66system 1:05.71elapsed 80%CPU (0avgtext+0avgdata 8431944maxresident)k
732504inputs+0outputs (220major+3764449minor)pagefaults 0swaps
python io hdf5 h5py
python io hdf5 h5py
edited Nov 30 '18 at 16:53
hpaulj
113k783151
113k783151
asked Nov 24 '18 at 13:26
mjgalindomjgalindo
443716
443716
What's your OS? If it's on Linux, try running the different versions of your Python code under/usr/bin/time /your/python/here
. That will show where the CPU time is spent - kernel/system or user-space, and provide some clue as to what's going on. Also on Linux, you can usestrace
to see what system calls are made in both versions. I suspect the "read it all into memory" unzips the data only once, while the slower method has to seek and unzip multiple times.
– Andrew Henle
Nov 26 '18 at 20:55
Also, if you're on LInux, flush your page cache before running each of your different versions
– Andrew Henle
Nov 26 '18 at 20:58
2
You are using fancy slicing. stackoverflow.com/a/48405220/4045774 I mentioned this case in the "simplest form of fancy indexing ". Your chunks are also extremely large. This isn't realy necessary if you use a proper chunk-cache size.
– max9111
Nov 27 '18 at 9:01
There are also much faster compression algorithms available than gzip. If portability between between multiple languages (gzip comes with every HDF5 installation, unlike the much faster blosc) isn't an issue, consider blosc. If your data offers a high compression ratio and stored on a HDD, it is often faster than a uncompressed dataset. eg. 900MB/s from a HDD -> stackoverflow.com/a/48997927/4045774
– max9111
Nov 27 '18 at 9:08
I wonder howdata[0][:,1]
does.
– hpaulj
Nov 30 '18 at 16:56
add a comment |
What's your OS? If it's on Linux, try running the different versions of your Python code under/usr/bin/time /your/python/here
. That will show where the CPU time is spent - kernel/system or user-space, and provide some clue as to what's going on. Also on Linux, you can usestrace
to see what system calls are made in both versions. I suspect the "read it all into memory" unzips the data only once, while the slower method has to seek and unzip multiple times.
– Andrew Henle
Nov 26 '18 at 20:55
Also, if you're on LInux, flush your page cache before running each of your different versions
– Andrew Henle
Nov 26 '18 at 20:58
2
You are using fancy slicing. stackoverflow.com/a/48405220/4045774 I mentioned this case in the "simplest form of fancy indexing ". Your chunks are also extremely large. This isn't realy necessary if you use a proper chunk-cache size.
– max9111
Nov 27 '18 at 9:01
There are also much faster compression algorithms available than gzip. If portability between between multiple languages (gzip comes with every HDF5 installation, unlike the much faster blosc) isn't an issue, consider blosc. If your data offers a high compression ratio and stored on a HDD, it is often faster than a uncompressed dataset. eg. 900MB/s from a HDD -> stackoverflow.com/a/48997927/4045774
– max9111
Nov 27 '18 at 9:08
I wonder howdata[0][:,1]
does.
– hpaulj
Nov 30 '18 at 16:56
What's your OS? If it's on Linux, try running the different versions of your Python code under
/usr/bin/time /your/python/here
. That will show where the CPU time is spent - kernel/system or user-space, and provide some clue as to what's going on. Also on Linux, you can use strace
to see what system calls are made in both versions. I suspect the "read it all into memory" unzips the data only once, while the slower method has to seek and unzip multiple times.– Andrew Henle
Nov 26 '18 at 20:55
What's your OS? If it's on Linux, try running the different versions of your Python code under
/usr/bin/time /your/python/here
. That will show where the CPU time is spent - kernel/system or user-space, and provide some clue as to what's going on. Also on Linux, you can use strace
to see what system calls are made in both versions. I suspect the "read it all into memory" unzips the data only once, while the slower method has to seek and unzip multiple times.– Andrew Henle
Nov 26 '18 at 20:55
Also, if you're on LInux, flush your page cache before running each of your different versions
– Andrew Henle
Nov 26 '18 at 20:58
Also, if you're on LInux, flush your page cache before running each of your different versions
– Andrew Henle
Nov 26 '18 at 20:58
2
2
You are using fancy slicing. stackoverflow.com/a/48405220/4045774 I mentioned this case in the "simplest form of fancy indexing ". Your chunks are also extremely large. This isn't realy necessary if you use a proper chunk-cache size.
– max9111
Nov 27 '18 at 9:01
You are using fancy slicing. stackoverflow.com/a/48405220/4045774 I mentioned this case in the "simplest form of fancy indexing ". Your chunks are also extremely large. This isn't realy necessary if you use a proper chunk-cache size.
– max9111
Nov 27 '18 at 9:01
There are also much faster compression algorithms available than gzip. If portability between between multiple languages (gzip comes with every HDF5 installation, unlike the much faster blosc) isn't an issue, consider blosc. If your data offers a high compression ratio and stored on a HDD, it is often faster than a uncompressed dataset. eg. 900MB/s from a HDD -> stackoverflow.com/a/48997927/4045774
– max9111
Nov 27 '18 at 9:08
There are also much faster compression algorithms available than gzip. If portability between between multiple languages (gzip comes with every HDF5 installation, unlike the much faster blosc) isn't an issue, consider blosc. If your data offers a high compression ratio and stored on a HDD, it is often faster than a uncompressed dataset. eg. 900MB/s from a HDD -> stackoverflow.com/a/48997927/4045774
– max9111
Nov 27 '18 at 9:08
I wonder how
data[0][:,1]
does.– hpaulj
Nov 30 '18 at 16:56
I wonder how
data[0][:,1]
does.– hpaulj
Nov 30 '18 at 16:56
add a comment |
1 Answer
1
active
oldest
votes
First I ran a test of my own. I don't have your HDF5 file, so using one of my test files. My test Table dataset has ~54,000 rows (which seems larger than yours).
Timing result with .value gives
>>> elapsed
0.15540122985839844
Timing result with NumPy indexing gives:
>>> elapsed2
0.12980079650878906
So, I don't see much difference in performance. Maybe it's related to the dataset sizes we are testing, or complexity of data tables?
A little reading of the the most recent h5py documentation has some interesting comments about Dataset.value
(from Release 2.8.0 - Jun 05, 2018; emphasis mine):
Dataset.value property is now deprecated.
The property Dataset.value
, which dates back to h5py 1.0, is deprecated and will be removed in a later release.
This property dumps the entire dataset into a NumPy array. Code using .value
should be updated to use NumPy indexing, using mydataset[...]
or mydataset[()]
as appropriate.
Your timing tests seem to be counter to the highlighted observation above.
I think you need to ask a h5py developer to comment on the performance differences (and where data is stored -- in memory vs on disk). Have you checked with the h5py user group?
Edit:
After posting, I found this SO Q&A. It has lots of good comments and includes responses from the h5py developer:
h5py: Correct way to slice array datasets
OP's matrix has over 1 billion elements.
– dshin
Nov 26 '18 at 19:15
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53458612%2fwhy-is-it-faster-to-read-whole-hdf5-dataset-than-a-slice%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
First I ran a test of my own. I don't have your HDF5 file, so using one of my test files. My test Table dataset has ~54,000 rows (which seems larger than yours).
Timing result with .value gives
>>> elapsed
0.15540122985839844
Timing result with NumPy indexing gives:
>>> elapsed2
0.12980079650878906
So, I don't see much difference in performance. Maybe it's related to the dataset sizes we are testing, or complexity of data tables?
A little reading of the the most recent h5py documentation has some interesting comments about Dataset.value
(from Release 2.8.0 - Jun 05, 2018; emphasis mine):
Dataset.value property is now deprecated.
The property Dataset.value
, which dates back to h5py 1.0, is deprecated and will be removed in a later release.
This property dumps the entire dataset into a NumPy array. Code using .value
should be updated to use NumPy indexing, using mydataset[...]
or mydataset[()]
as appropriate.
Your timing tests seem to be counter to the highlighted observation above.
I think you need to ask a h5py developer to comment on the performance differences (and where data is stored -- in memory vs on disk). Have you checked with the h5py user group?
Edit:
After posting, I found this SO Q&A. It has lots of good comments and includes responses from the h5py developer:
h5py: Correct way to slice array datasets
OP's matrix has over 1 billion elements.
– dshin
Nov 26 '18 at 19:15
add a comment |
First I ran a test of my own. I don't have your HDF5 file, so using one of my test files. My test Table dataset has ~54,000 rows (which seems larger than yours).
Timing result with .value gives
>>> elapsed
0.15540122985839844
Timing result with NumPy indexing gives:
>>> elapsed2
0.12980079650878906
So, I don't see much difference in performance. Maybe it's related to the dataset sizes we are testing, or complexity of data tables?
A little reading of the the most recent h5py documentation has some interesting comments about Dataset.value
(from Release 2.8.0 - Jun 05, 2018; emphasis mine):
Dataset.value property is now deprecated.
The property Dataset.value
, which dates back to h5py 1.0, is deprecated and will be removed in a later release.
This property dumps the entire dataset into a NumPy array. Code using .value
should be updated to use NumPy indexing, using mydataset[...]
or mydataset[()]
as appropriate.
Your timing tests seem to be counter to the highlighted observation above.
I think you need to ask a h5py developer to comment on the performance differences (and where data is stored -- in memory vs on disk). Have you checked with the h5py user group?
Edit:
After posting, I found this SO Q&A. It has lots of good comments and includes responses from the h5py developer:
h5py: Correct way to slice array datasets
OP's matrix has over 1 billion elements.
– dshin
Nov 26 '18 at 19:15
add a comment |
First I ran a test of my own. I don't have your HDF5 file, so using one of my test files. My test Table dataset has ~54,000 rows (which seems larger than yours).
Timing result with .value gives
>>> elapsed
0.15540122985839844
Timing result with NumPy indexing gives:
>>> elapsed2
0.12980079650878906
So, I don't see much difference in performance. Maybe it's related to the dataset sizes we are testing, or complexity of data tables?
A little reading of the the most recent h5py documentation has some interesting comments about Dataset.value
(from Release 2.8.0 - Jun 05, 2018; emphasis mine):
Dataset.value property is now deprecated.
The property Dataset.value
, which dates back to h5py 1.0, is deprecated and will be removed in a later release.
This property dumps the entire dataset into a NumPy array. Code using .value
should be updated to use NumPy indexing, using mydataset[...]
or mydataset[()]
as appropriate.
Your timing tests seem to be counter to the highlighted observation above.
I think you need to ask a h5py developer to comment on the performance differences (and where data is stored -- in memory vs on disk). Have you checked with the h5py user group?
Edit:
After posting, I found this SO Q&A. It has lots of good comments and includes responses from the h5py developer:
h5py: Correct way to slice array datasets
First I ran a test of my own. I don't have your HDF5 file, so using one of my test files. My test Table dataset has ~54,000 rows (which seems larger than yours).
Timing result with .value gives
>>> elapsed
0.15540122985839844
Timing result with NumPy indexing gives:
>>> elapsed2
0.12980079650878906
So, I don't see much difference in performance. Maybe it's related to the dataset sizes we are testing, or complexity of data tables?
A little reading of the the most recent h5py documentation has some interesting comments about Dataset.value
(from Release 2.8.0 - Jun 05, 2018; emphasis mine):
Dataset.value property is now deprecated.
The property Dataset.value
, which dates back to h5py 1.0, is deprecated and will be removed in a later release.
This property dumps the entire dataset into a NumPy array. Code using .value
should be updated to use NumPy indexing, using mydataset[...]
or mydataset[()]
as appropriate.
Your timing tests seem to be counter to the highlighted observation above.
I think you need to ask a h5py developer to comment on the performance differences (and where data is stored -- in memory vs on disk). Have you checked with the h5py user group?
Edit:
After posting, I found this SO Q&A. It has lots of good comments and includes responses from the h5py developer:
h5py: Correct way to slice array datasets
edited Nov 26 '18 at 19:07
answered Nov 26 '18 at 19:01
kcw78kcw78
345210
345210
OP's matrix has over 1 billion elements.
– dshin
Nov 26 '18 at 19:15
add a comment |
OP's matrix has over 1 billion elements.
– dshin
Nov 26 '18 at 19:15
OP's matrix has over 1 billion elements.
– dshin
Nov 26 '18 at 19:15
OP's matrix has over 1 billion elements.
– dshin
Nov 26 '18 at 19:15
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53458612%2fwhy-is-it-faster-to-read-whole-hdf5-dataset-than-a-slice%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
What's your OS? If it's on Linux, try running the different versions of your Python code under
/usr/bin/time /your/python/here
. That will show where the CPU time is spent - kernel/system or user-space, and provide some clue as to what's going on. Also on Linux, you can usestrace
to see what system calls are made in both versions. I suspect the "read it all into memory" unzips the data only once, while the slower method has to seek and unzip multiple times.– Andrew Henle
Nov 26 '18 at 20:55
Also, if you're on LInux, flush your page cache before running each of your different versions
– Andrew Henle
Nov 26 '18 at 20:58
2
You are using fancy slicing. stackoverflow.com/a/48405220/4045774 I mentioned this case in the "simplest form of fancy indexing ". Your chunks are also extremely large. This isn't realy necessary if you use a proper chunk-cache size.
– max9111
Nov 27 '18 at 9:01
There are also much faster compression algorithms available than gzip. If portability between between multiple languages (gzip comes with every HDF5 installation, unlike the much faster blosc) isn't an issue, consider blosc. If your data offers a high compression ratio and stored on a HDD, it is often faster than a uncompressed dataset. eg. 900MB/s from a HDD -> stackoverflow.com/a/48997927/4045774
– max9111
Nov 27 '18 at 9:08
I wonder how
data[0][:,1]
does.– hpaulj
Nov 30 '18 at 16:56