Curious UDP throughput phenomenon
I've been experimenting with maximising UDP throughput from a relatively low powered ARM SoC (a dual core cortex-A9) with a built in 1G ethernet MAC.
My experiments have taken various avenues, but the following I'm unable to explain.
The code I'm using is here. It's a fairly simple little program (compiled with gcc -O3 udp_splurger.c -o udp_splurger
) which outputs a fixed number of UDP packets and then reports back how long it takes to do that, calculating a total output data rate for that number of packets.
When I run the program alone, the first thing to note is it matters which core the program runs on. I can explain this as the interrupts having a hard affinity to the first core, so when the program runs on that core, the interrupt handler competes with the program and the throughput goes down. So for the two threads, I see the following:
For CPU 0 (with 50% CPU, 1 core of a 2 core system):
$ sudo taskset 1 nice -10 ./udp_splurger 1
1: Writing simple data packets...
1: Runtime to send 131072 packets: 3.023376 seconds (483.817893 Mbits per second)
1: Runtime to send 131072 packets: 3.008770 seconds (486.166586 Mbits per second)
1: Runtime to send 131072 packets: 3.015237 seconds (485.123893 Mbits per second)
For CPU 1 (with 60% CPU, >1 core of a 2 core system):
$ sudo taskset 2 nice -10 ./udp_splurger 1
1: Writing simple data packets...
1: Runtime to send 131072 packets: 1.974865 seconds (740.690268 Mbits per second)
1: Runtime to send 131072 packets: 1.973994 seconds (741.017183 Mbits per second)
1: Runtime to send 131072 packets: 1.975528 seconds (740.441811 Mbits per second)
The curiosity is what happens when I then try to run two processes. so:
$ sudo taskset 2 nice -10 ./udp_splurger 1 & sudo taskset 1 nice -10 ./udp_splurger 2
[3] 1578
2: Writing simple data packets...
1: Writing simple data packets...
1: Runtime to send 131072 packets: 1.581942 seconds (924.662958 Mbits per second)
1: Runtime to send 131072 packets: 1.586901 seconds (921.773395 Mbits per second)
1: Runtime to send 131072 packets: 1.579631 seconds (926.016226 Mbits per second)
2: Runtime to send 131072 packets: 7.471531 seconds (195.778279 Mbits per second)
2: Runtime to send 131072 packets: 3.004867 seconds (486.798071 Mbits per second)
2: Runtime to send 131072 packets: 3.003318 seconds (487.049127 Mbits per second)
When both processes are running, the net CPU time are something like 50% and 42%.
On one level, this sort of seems ok - we've increased the total CPU usage to a CPU bound problem and achieved a greater throughput. However, I can't understand why one process is suddenly apparently much faster.
I initially wondered whether the second process is starving the interrupt handler of CPU time, which means the interrupts are coalescing and so the total time given to interrupt handling is reduced, but I would have thought a simple CPU stress program would have the same result, but it doesn't (unless we need some system calls?).
Any ideas what is going on?
Can I make this happen without the roundabout mechanism of running two processes?
I can confirm the correct number of packets are being received on the destination interface, so I'm confident those numbers are not just simply wrong.
c linux sockets optimization udp
add a comment |
I've been experimenting with maximising UDP throughput from a relatively low powered ARM SoC (a dual core cortex-A9) with a built in 1G ethernet MAC.
My experiments have taken various avenues, but the following I'm unable to explain.
The code I'm using is here. It's a fairly simple little program (compiled with gcc -O3 udp_splurger.c -o udp_splurger
) which outputs a fixed number of UDP packets and then reports back how long it takes to do that, calculating a total output data rate for that number of packets.
When I run the program alone, the first thing to note is it matters which core the program runs on. I can explain this as the interrupts having a hard affinity to the first core, so when the program runs on that core, the interrupt handler competes with the program and the throughput goes down. So for the two threads, I see the following:
For CPU 0 (with 50% CPU, 1 core of a 2 core system):
$ sudo taskset 1 nice -10 ./udp_splurger 1
1: Writing simple data packets...
1: Runtime to send 131072 packets: 3.023376 seconds (483.817893 Mbits per second)
1: Runtime to send 131072 packets: 3.008770 seconds (486.166586 Mbits per second)
1: Runtime to send 131072 packets: 3.015237 seconds (485.123893 Mbits per second)
For CPU 1 (with 60% CPU, >1 core of a 2 core system):
$ sudo taskset 2 nice -10 ./udp_splurger 1
1: Writing simple data packets...
1: Runtime to send 131072 packets: 1.974865 seconds (740.690268 Mbits per second)
1: Runtime to send 131072 packets: 1.973994 seconds (741.017183 Mbits per second)
1: Runtime to send 131072 packets: 1.975528 seconds (740.441811 Mbits per second)
The curiosity is what happens when I then try to run two processes. so:
$ sudo taskset 2 nice -10 ./udp_splurger 1 & sudo taskset 1 nice -10 ./udp_splurger 2
[3] 1578
2: Writing simple data packets...
1: Writing simple data packets...
1: Runtime to send 131072 packets: 1.581942 seconds (924.662958 Mbits per second)
1: Runtime to send 131072 packets: 1.586901 seconds (921.773395 Mbits per second)
1: Runtime to send 131072 packets: 1.579631 seconds (926.016226 Mbits per second)
2: Runtime to send 131072 packets: 7.471531 seconds (195.778279 Mbits per second)
2: Runtime to send 131072 packets: 3.004867 seconds (486.798071 Mbits per second)
2: Runtime to send 131072 packets: 3.003318 seconds (487.049127 Mbits per second)
When both processes are running, the net CPU time are something like 50% and 42%.
On one level, this sort of seems ok - we've increased the total CPU usage to a CPU bound problem and achieved a greater throughput. However, I can't understand why one process is suddenly apparently much faster.
I initially wondered whether the second process is starving the interrupt handler of CPU time, which means the interrupts are coalescing and so the total time given to interrupt handling is reduced, but I would have thought a simple CPU stress program would have the same result, but it doesn't (unless we need some system calls?).
Any ideas what is going on?
Can I make this happen without the roundabout mechanism of running two processes?
I can confirm the correct number of packets are being received on the destination interface, so I'm confident those numbers are not just simply wrong.
c linux sockets optimization udp
1
Too many possibilities to know without knowing the exact hardware. For example, the hardware could be throttling CPU speed until load exceeds some threshold, or it could be a hardware quirk or even a driver "bug" specific to that SBC. Anyway, instead of using several processes, just use pthreads. Each thread is scheduled just like a process in Linux.
– Nominal Animal
Nov 23 '18 at 14:57
@NominalAnimal yeah, that's the plan - I just wanted to do a quick test with a pair of processes to see if it would increase throughput.
– Henry Gomersall
Nov 23 '18 at 15:46
You should get near-maximum UDP packet rate using two, three, or four threads, and each sending several UDP packets per syscall usingsendmmsg()
. At high packet rates the context switches (between userspace and kernel) become a limitation, so usingsendmmsg()
to send multiple messages per call will help. (The sender address will be the same, and depend on the socket you use for the call, but each packet can have a different recipient and/or contents.)
– Nominal Animal
Nov 23 '18 at 20:02
yeah, I've been exploring all of those things - including withAF_PACKET
sockets andPACKET_MMAP
. Each little thing gives an incremental improvement with a steady increase in complexity!
– Henry Gomersall
Nov 23 '18 at 20:31
add a comment |
I've been experimenting with maximising UDP throughput from a relatively low powered ARM SoC (a dual core cortex-A9) with a built in 1G ethernet MAC.
My experiments have taken various avenues, but the following I'm unable to explain.
The code I'm using is here. It's a fairly simple little program (compiled with gcc -O3 udp_splurger.c -o udp_splurger
) which outputs a fixed number of UDP packets and then reports back how long it takes to do that, calculating a total output data rate for that number of packets.
When I run the program alone, the first thing to note is it matters which core the program runs on. I can explain this as the interrupts having a hard affinity to the first core, so when the program runs on that core, the interrupt handler competes with the program and the throughput goes down. So for the two threads, I see the following:
For CPU 0 (with 50% CPU, 1 core of a 2 core system):
$ sudo taskset 1 nice -10 ./udp_splurger 1
1: Writing simple data packets...
1: Runtime to send 131072 packets: 3.023376 seconds (483.817893 Mbits per second)
1: Runtime to send 131072 packets: 3.008770 seconds (486.166586 Mbits per second)
1: Runtime to send 131072 packets: 3.015237 seconds (485.123893 Mbits per second)
For CPU 1 (with 60% CPU, >1 core of a 2 core system):
$ sudo taskset 2 nice -10 ./udp_splurger 1
1: Writing simple data packets...
1: Runtime to send 131072 packets: 1.974865 seconds (740.690268 Mbits per second)
1: Runtime to send 131072 packets: 1.973994 seconds (741.017183 Mbits per second)
1: Runtime to send 131072 packets: 1.975528 seconds (740.441811 Mbits per second)
The curiosity is what happens when I then try to run two processes. so:
$ sudo taskset 2 nice -10 ./udp_splurger 1 & sudo taskset 1 nice -10 ./udp_splurger 2
[3] 1578
2: Writing simple data packets...
1: Writing simple data packets...
1: Runtime to send 131072 packets: 1.581942 seconds (924.662958 Mbits per second)
1: Runtime to send 131072 packets: 1.586901 seconds (921.773395 Mbits per second)
1: Runtime to send 131072 packets: 1.579631 seconds (926.016226 Mbits per second)
2: Runtime to send 131072 packets: 7.471531 seconds (195.778279 Mbits per second)
2: Runtime to send 131072 packets: 3.004867 seconds (486.798071 Mbits per second)
2: Runtime to send 131072 packets: 3.003318 seconds (487.049127 Mbits per second)
When both processes are running, the net CPU time are something like 50% and 42%.
On one level, this sort of seems ok - we've increased the total CPU usage to a CPU bound problem and achieved a greater throughput. However, I can't understand why one process is suddenly apparently much faster.
I initially wondered whether the second process is starving the interrupt handler of CPU time, which means the interrupts are coalescing and so the total time given to interrupt handling is reduced, but I would have thought a simple CPU stress program would have the same result, but it doesn't (unless we need some system calls?).
Any ideas what is going on?
Can I make this happen without the roundabout mechanism of running two processes?
I can confirm the correct number of packets are being received on the destination interface, so I'm confident those numbers are not just simply wrong.
c linux sockets optimization udp
I've been experimenting with maximising UDP throughput from a relatively low powered ARM SoC (a dual core cortex-A9) with a built in 1G ethernet MAC.
My experiments have taken various avenues, but the following I'm unable to explain.
The code I'm using is here. It's a fairly simple little program (compiled with gcc -O3 udp_splurger.c -o udp_splurger
) which outputs a fixed number of UDP packets and then reports back how long it takes to do that, calculating a total output data rate for that number of packets.
When I run the program alone, the first thing to note is it matters which core the program runs on. I can explain this as the interrupts having a hard affinity to the first core, so when the program runs on that core, the interrupt handler competes with the program and the throughput goes down. So for the two threads, I see the following:
For CPU 0 (with 50% CPU, 1 core of a 2 core system):
$ sudo taskset 1 nice -10 ./udp_splurger 1
1: Writing simple data packets...
1: Runtime to send 131072 packets: 3.023376 seconds (483.817893 Mbits per second)
1: Runtime to send 131072 packets: 3.008770 seconds (486.166586 Mbits per second)
1: Runtime to send 131072 packets: 3.015237 seconds (485.123893 Mbits per second)
For CPU 1 (with 60% CPU, >1 core of a 2 core system):
$ sudo taskset 2 nice -10 ./udp_splurger 1
1: Writing simple data packets...
1: Runtime to send 131072 packets: 1.974865 seconds (740.690268 Mbits per second)
1: Runtime to send 131072 packets: 1.973994 seconds (741.017183 Mbits per second)
1: Runtime to send 131072 packets: 1.975528 seconds (740.441811 Mbits per second)
The curiosity is what happens when I then try to run two processes. so:
$ sudo taskset 2 nice -10 ./udp_splurger 1 & sudo taskset 1 nice -10 ./udp_splurger 2
[3] 1578
2: Writing simple data packets...
1: Writing simple data packets...
1: Runtime to send 131072 packets: 1.581942 seconds (924.662958 Mbits per second)
1: Runtime to send 131072 packets: 1.586901 seconds (921.773395 Mbits per second)
1: Runtime to send 131072 packets: 1.579631 seconds (926.016226 Mbits per second)
2: Runtime to send 131072 packets: 7.471531 seconds (195.778279 Mbits per second)
2: Runtime to send 131072 packets: 3.004867 seconds (486.798071 Mbits per second)
2: Runtime to send 131072 packets: 3.003318 seconds (487.049127 Mbits per second)
When both processes are running, the net CPU time are something like 50% and 42%.
On one level, this sort of seems ok - we've increased the total CPU usage to a CPU bound problem and achieved a greater throughput. However, I can't understand why one process is suddenly apparently much faster.
I initially wondered whether the second process is starving the interrupt handler of CPU time, which means the interrupts are coalescing and so the total time given to interrupt handling is reduced, but I would have thought a simple CPU stress program would have the same result, but it doesn't (unless we need some system calls?).
Any ideas what is going on?
Can I make this happen without the roundabout mechanism of running two processes?
I can confirm the correct number of packets are being received on the destination interface, so I'm confident those numbers are not just simply wrong.
c linux sockets optimization udp
c linux sockets optimization udp
edited Nov 23 '18 at 14:15
Henry Gomersall
asked Nov 23 '18 at 13:59
Henry GomersallHenry Gomersall
5,43621840
5,43621840
1
Too many possibilities to know without knowing the exact hardware. For example, the hardware could be throttling CPU speed until load exceeds some threshold, or it could be a hardware quirk or even a driver "bug" specific to that SBC. Anyway, instead of using several processes, just use pthreads. Each thread is scheduled just like a process in Linux.
– Nominal Animal
Nov 23 '18 at 14:57
@NominalAnimal yeah, that's the plan - I just wanted to do a quick test with a pair of processes to see if it would increase throughput.
– Henry Gomersall
Nov 23 '18 at 15:46
You should get near-maximum UDP packet rate using two, three, or four threads, and each sending several UDP packets per syscall usingsendmmsg()
. At high packet rates the context switches (between userspace and kernel) become a limitation, so usingsendmmsg()
to send multiple messages per call will help. (The sender address will be the same, and depend on the socket you use for the call, but each packet can have a different recipient and/or contents.)
– Nominal Animal
Nov 23 '18 at 20:02
yeah, I've been exploring all of those things - including withAF_PACKET
sockets andPACKET_MMAP
. Each little thing gives an incremental improvement with a steady increase in complexity!
– Henry Gomersall
Nov 23 '18 at 20:31
add a comment |
1
Too many possibilities to know without knowing the exact hardware. For example, the hardware could be throttling CPU speed until load exceeds some threshold, or it could be a hardware quirk or even a driver "bug" specific to that SBC. Anyway, instead of using several processes, just use pthreads. Each thread is scheduled just like a process in Linux.
– Nominal Animal
Nov 23 '18 at 14:57
@NominalAnimal yeah, that's the plan - I just wanted to do a quick test with a pair of processes to see if it would increase throughput.
– Henry Gomersall
Nov 23 '18 at 15:46
You should get near-maximum UDP packet rate using two, three, or four threads, and each sending several UDP packets per syscall usingsendmmsg()
. At high packet rates the context switches (between userspace and kernel) become a limitation, so usingsendmmsg()
to send multiple messages per call will help. (The sender address will be the same, and depend on the socket you use for the call, but each packet can have a different recipient and/or contents.)
– Nominal Animal
Nov 23 '18 at 20:02
yeah, I've been exploring all of those things - including withAF_PACKET
sockets andPACKET_MMAP
. Each little thing gives an incremental improvement with a steady increase in complexity!
– Henry Gomersall
Nov 23 '18 at 20:31
1
1
Too many possibilities to know without knowing the exact hardware. For example, the hardware could be throttling CPU speed until load exceeds some threshold, or it could be a hardware quirk or even a driver "bug" specific to that SBC. Anyway, instead of using several processes, just use pthreads. Each thread is scheduled just like a process in Linux.
– Nominal Animal
Nov 23 '18 at 14:57
Too many possibilities to know without knowing the exact hardware. For example, the hardware could be throttling CPU speed until load exceeds some threshold, or it could be a hardware quirk or even a driver "bug" specific to that SBC. Anyway, instead of using several processes, just use pthreads. Each thread is scheduled just like a process in Linux.
– Nominal Animal
Nov 23 '18 at 14:57
@NominalAnimal yeah, that's the plan - I just wanted to do a quick test with a pair of processes to see if it would increase throughput.
– Henry Gomersall
Nov 23 '18 at 15:46
@NominalAnimal yeah, that's the plan - I just wanted to do a quick test with a pair of processes to see if it would increase throughput.
– Henry Gomersall
Nov 23 '18 at 15:46
You should get near-maximum UDP packet rate using two, three, or four threads, and each sending several UDP packets per syscall using
sendmmsg()
. At high packet rates the context switches (between userspace and kernel) become a limitation, so using sendmmsg()
to send multiple messages per call will help. (The sender address will be the same, and depend on the socket you use for the call, but each packet can have a different recipient and/or contents.)– Nominal Animal
Nov 23 '18 at 20:02
You should get near-maximum UDP packet rate using two, three, or four threads, and each sending several UDP packets per syscall using
sendmmsg()
. At high packet rates the context switches (between userspace and kernel) become a limitation, so using sendmmsg()
to send multiple messages per call will help. (The sender address will be the same, and depend on the socket you use for the call, but each packet can have a different recipient and/or contents.)– Nominal Animal
Nov 23 '18 at 20:02
yeah, I've been exploring all of those things - including with
AF_PACKET
sockets and PACKET_MMAP
. Each little thing gives an incremental improvement with a steady increase in complexity!– Henry Gomersall
Nov 23 '18 at 20:31
yeah, I've been exploring all of those things - including with
AF_PACKET
sockets and PACKET_MMAP
. Each little thing gives an incremental improvement with a steady increase in complexity!– Henry Gomersall
Nov 23 '18 at 20:31
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53448067%2fcurious-udp-throughput-phenomenon%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53448067%2fcurious-udp-throughput-phenomenon%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Too many possibilities to know without knowing the exact hardware. For example, the hardware could be throttling CPU speed until load exceeds some threshold, or it could be a hardware quirk or even a driver "bug" specific to that SBC. Anyway, instead of using several processes, just use pthreads. Each thread is scheduled just like a process in Linux.
– Nominal Animal
Nov 23 '18 at 14:57
@NominalAnimal yeah, that's the plan - I just wanted to do a quick test with a pair of processes to see if it would increase throughput.
– Henry Gomersall
Nov 23 '18 at 15:46
You should get near-maximum UDP packet rate using two, three, or four threads, and each sending several UDP packets per syscall using
sendmmsg()
. At high packet rates the context switches (between userspace and kernel) become a limitation, so usingsendmmsg()
to send multiple messages per call will help. (The sender address will be the same, and depend on the socket you use for the call, but each packet can have a different recipient and/or contents.)– Nominal Animal
Nov 23 '18 at 20:02
yeah, I've been exploring all of those things - including with
AF_PACKET
sockets andPACKET_MMAP
. Each little thing gives an incremental improvement with a steady increase in complexity!– Henry Gomersall
Nov 23 '18 at 20:31