Curious UDP throughput phenomenon












1















I've been experimenting with maximising UDP throughput from a relatively low powered ARM SoC (a dual core cortex-A9) with a built in 1G ethernet MAC.



My experiments have taken various avenues, but the following I'm unable to explain.



The code I'm using is here. It's a fairly simple little program (compiled with gcc -O3 udp_splurger.c -o udp_splurger) which outputs a fixed number of UDP packets and then reports back how long it takes to do that, calculating a total output data rate for that number of packets.



When I run the program alone, the first thing to note is it matters which core the program runs on. I can explain this as the interrupts having a hard affinity to the first core, so when the program runs on that core, the interrupt handler competes with the program and the throughput goes down. So for the two threads, I see the following:



For CPU 0 (with 50% CPU, 1 core of a 2 core system):



$ sudo taskset 1 nice -10 ./udp_splurger 1
1: Writing simple data packets...
1: Runtime to send 131072 packets: 3.023376 seconds (483.817893 Mbits per second)
1: Runtime to send 131072 packets: 3.008770 seconds (486.166586 Mbits per second)
1: Runtime to send 131072 packets: 3.015237 seconds (485.123893 Mbits per second)


For CPU 1 (with 60% CPU, >1 core of a 2 core system):



$ sudo taskset 2 nice -10 ./udp_splurger 1  
1: Writing simple data packets...
1: Runtime to send 131072 packets: 1.974865 seconds (740.690268 Mbits per second)
1: Runtime to send 131072 packets: 1.973994 seconds (741.017183 Mbits per second)
1: Runtime to send 131072 packets: 1.975528 seconds (740.441811 Mbits per second)


The curiosity is what happens when I then try to run two processes. so:



$ sudo taskset 2 nice -10 ./udp_splurger 1 & sudo taskset 1 nice -10 ./udp_splurger 2
[3] 1578
2: Writing simple data packets...
1: Writing simple data packets...
1: Runtime to send 131072 packets: 1.581942 seconds (924.662958 Mbits per second)
1: Runtime to send 131072 packets: 1.586901 seconds (921.773395 Mbits per second)
1: Runtime to send 131072 packets: 1.579631 seconds (926.016226 Mbits per second)
2: Runtime to send 131072 packets: 7.471531 seconds (195.778279 Mbits per second)
2: Runtime to send 131072 packets: 3.004867 seconds (486.798071 Mbits per second)
2: Runtime to send 131072 packets: 3.003318 seconds (487.049127 Mbits per second)


When both processes are running, the net CPU time are something like 50% and 42%.



On one level, this sort of seems ok - we've increased the total CPU usage to a CPU bound problem and achieved a greater throughput. However, I can't understand why one process is suddenly apparently much faster.



I initially wondered whether the second process is starving the interrupt handler of CPU time, which means the interrupts are coalescing and so the total time given to interrupt handling is reduced, but I would have thought a simple CPU stress program would have the same result, but it doesn't (unless we need some system calls?).



Any ideas what is going on?



Can I make this happen without the roundabout mechanism of running two processes?



I can confirm the correct number of packets are being received on the destination interface, so I'm confident those numbers are not just simply wrong.










share|improve this question




















  • 1





    Too many possibilities to know without knowing the exact hardware. For example, the hardware could be throttling CPU speed until load exceeds some threshold, or it could be a hardware quirk or even a driver "bug" specific to that SBC. Anyway, instead of using several processes, just use pthreads. Each thread is scheduled just like a process in Linux.

    – Nominal Animal
    Nov 23 '18 at 14:57











  • @NominalAnimal yeah, that's the plan - I just wanted to do a quick test with a pair of processes to see if it would increase throughput.

    – Henry Gomersall
    Nov 23 '18 at 15:46











  • You should get near-maximum UDP packet rate using two, three, or four threads, and each sending several UDP packets per syscall using sendmmsg(). At high packet rates the context switches (between userspace and kernel) become a limitation, so using sendmmsg() to send multiple messages per call will help. (The sender address will be the same, and depend on the socket you use for the call, but each packet can have a different recipient and/or contents.)

    – Nominal Animal
    Nov 23 '18 at 20:02











  • yeah, I've been exploring all of those things - including with AF_PACKET sockets and PACKET_MMAP. Each little thing gives an incremental improvement with a steady increase in complexity!

    – Henry Gomersall
    Nov 23 '18 at 20:31
















1















I've been experimenting with maximising UDP throughput from a relatively low powered ARM SoC (a dual core cortex-A9) with a built in 1G ethernet MAC.



My experiments have taken various avenues, but the following I'm unable to explain.



The code I'm using is here. It's a fairly simple little program (compiled with gcc -O3 udp_splurger.c -o udp_splurger) which outputs a fixed number of UDP packets and then reports back how long it takes to do that, calculating a total output data rate for that number of packets.



When I run the program alone, the first thing to note is it matters which core the program runs on. I can explain this as the interrupts having a hard affinity to the first core, so when the program runs on that core, the interrupt handler competes with the program and the throughput goes down. So for the two threads, I see the following:



For CPU 0 (with 50% CPU, 1 core of a 2 core system):



$ sudo taskset 1 nice -10 ./udp_splurger 1
1: Writing simple data packets...
1: Runtime to send 131072 packets: 3.023376 seconds (483.817893 Mbits per second)
1: Runtime to send 131072 packets: 3.008770 seconds (486.166586 Mbits per second)
1: Runtime to send 131072 packets: 3.015237 seconds (485.123893 Mbits per second)


For CPU 1 (with 60% CPU, >1 core of a 2 core system):



$ sudo taskset 2 nice -10 ./udp_splurger 1  
1: Writing simple data packets...
1: Runtime to send 131072 packets: 1.974865 seconds (740.690268 Mbits per second)
1: Runtime to send 131072 packets: 1.973994 seconds (741.017183 Mbits per second)
1: Runtime to send 131072 packets: 1.975528 seconds (740.441811 Mbits per second)


The curiosity is what happens when I then try to run two processes. so:



$ sudo taskset 2 nice -10 ./udp_splurger 1 & sudo taskset 1 nice -10 ./udp_splurger 2
[3] 1578
2: Writing simple data packets...
1: Writing simple data packets...
1: Runtime to send 131072 packets: 1.581942 seconds (924.662958 Mbits per second)
1: Runtime to send 131072 packets: 1.586901 seconds (921.773395 Mbits per second)
1: Runtime to send 131072 packets: 1.579631 seconds (926.016226 Mbits per second)
2: Runtime to send 131072 packets: 7.471531 seconds (195.778279 Mbits per second)
2: Runtime to send 131072 packets: 3.004867 seconds (486.798071 Mbits per second)
2: Runtime to send 131072 packets: 3.003318 seconds (487.049127 Mbits per second)


When both processes are running, the net CPU time are something like 50% and 42%.



On one level, this sort of seems ok - we've increased the total CPU usage to a CPU bound problem and achieved a greater throughput. However, I can't understand why one process is suddenly apparently much faster.



I initially wondered whether the second process is starving the interrupt handler of CPU time, which means the interrupts are coalescing and so the total time given to interrupt handling is reduced, but I would have thought a simple CPU stress program would have the same result, but it doesn't (unless we need some system calls?).



Any ideas what is going on?



Can I make this happen without the roundabout mechanism of running two processes?



I can confirm the correct number of packets are being received on the destination interface, so I'm confident those numbers are not just simply wrong.










share|improve this question




















  • 1





    Too many possibilities to know without knowing the exact hardware. For example, the hardware could be throttling CPU speed until load exceeds some threshold, or it could be a hardware quirk or even a driver "bug" specific to that SBC. Anyway, instead of using several processes, just use pthreads. Each thread is scheduled just like a process in Linux.

    – Nominal Animal
    Nov 23 '18 at 14:57











  • @NominalAnimal yeah, that's the plan - I just wanted to do a quick test with a pair of processes to see if it would increase throughput.

    – Henry Gomersall
    Nov 23 '18 at 15:46











  • You should get near-maximum UDP packet rate using two, three, or four threads, and each sending several UDP packets per syscall using sendmmsg(). At high packet rates the context switches (between userspace and kernel) become a limitation, so using sendmmsg() to send multiple messages per call will help. (The sender address will be the same, and depend on the socket you use for the call, but each packet can have a different recipient and/or contents.)

    – Nominal Animal
    Nov 23 '18 at 20:02











  • yeah, I've been exploring all of those things - including with AF_PACKET sockets and PACKET_MMAP. Each little thing gives an incremental improvement with a steady increase in complexity!

    – Henry Gomersall
    Nov 23 '18 at 20:31














1












1








1








I've been experimenting with maximising UDP throughput from a relatively low powered ARM SoC (a dual core cortex-A9) with a built in 1G ethernet MAC.



My experiments have taken various avenues, but the following I'm unable to explain.



The code I'm using is here. It's a fairly simple little program (compiled with gcc -O3 udp_splurger.c -o udp_splurger) which outputs a fixed number of UDP packets and then reports back how long it takes to do that, calculating a total output data rate for that number of packets.



When I run the program alone, the first thing to note is it matters which core the program runs on. I can explain this as the interrupts having a hard affinity to the first core, so when the program runs on that core, the interrupt handler competes with the program and the throughput goes down. So for the two threads, I see the following:



For CPU 0 (with 50% CPU, 1 core of a 2 core system):



$ sudo taskset 1 nice -10 ./udp_splurger 1
1: Writing simple data packets...
1: Runtime to send 131072 packets: 3.023376 seconds (483.817893 Mbits per second)
1: Runtime to send 131072 packets: 3.008770 seconds (486.166586 Mbits per second)
1: Runtime to send 131072 packets: 3.015237 seconds (485.123893 Mbits per second)


For CPU 1 (with 60% CPU, >1 core of a 2 core system):



$ sudo taskset 2 nice -10 ./udp_splurger 1  
1: Writing simple data packets...
1: Runtime to send 131072 packets: 1.974865 seconds (740.690268 Mbits per second)
1: Runtime to send 131072 packets: 1.973994 seconds (741.017183 Mbits per second)
1: Runtime to send 131072 packets: 1.975528 seconds (740.441811 Mbits per second)


The curiosity is what happens when I then try to run two processes. so:



$ sudo taskset 2 nice -10 ./udp_splurger 1 & sudo taskset 1 nice -10 ./udp_splurger 2
[3] 1578
2: Writing simple data packets...
1: Writing simple data packets...
1: Runtime to send 131072 packets: 1.581942 seconds (924.662958 Mbits per second)
1: Runtime to send 131072 packets: 1.586901 seconds (921.773395 Mbits per second)
1: Runtime to send 131072 packets: 1.579631 seconds (926.016226 Mbits per second)
2: Runtime to send 131072 packets: 7.471531 seconds (195.778279 Mbits per second)
2: Runtime to send 131072 packets: 3.004867 seconds (486.798071 Mbits per second)
2: Runtime to send 131072 packets: 3.003318 seconds (487.049127 Mbits per second)


When both processes are running, the net CPU time are something like 50% and 42%.



On one level, this sort of seems ok - we've increased the total CPU usage to a CPU bound problem and achieved a greater throughput. However, I can't understand why one process is suddenly apparently much faster.



I initially wondered whether the second process is starving the interrupt handler of CPU time, which means the interrupts are coalescing and so the total time given to interrupt handling is reduced, but I would have thought a simple CPU stress program would have the same result, but it doesn't (unless we need some system calls?).



Any ideas what is going on?



Can I make this happen without the roundabout mechanism of running two processes?



I can confirm the correct number of packets are being received on the destination interface, so I'm confident those numbers are not just simply wrong.










share|improve this question
















I've been experimenting with maximising UDP throughput from a relatively low powered ARM SoC (a dual core cortex-A9) with a built in 1G ethernet MAC.



My experiments have taken various avenues, but the following I'm unable to explain.



The code I'm using is here. It's a fairly simple little program (compiled with gcc -O3 udp_splurger.c -o udp_splurger) which outputs a fixed number of UDP packets and then reports back how long it takes to do that, calculating a total output data rate for that number of packets.



When I run the program alone, the first thing to note is it matters which core the program runs on. I can explain this as the interrupts having a hard affinity to the first core, so when the program runs on that core, the interrupt handler competes with the program and the throughput goes down. So for the two threads, I see the following:



For CPU 0 (with 50% CPU, 1 core of a 2 core system):



$ sudo taskset 1 nice -10 ./udp_splurger 1
1: Writing simple data packets...
1: Runtime to send 131072 packets: 3.023376 seconds (483.817893 Mbits per second)
1: Runtime to send 131072 packets: 3.008770 seconds (486.166586 Mbits per second)
1: Runtime to send 131072 packets: 3.015237 seconds (485.123893 Mbits per second)


For CPU 1 (with 60% CPU, >1 core of a 2 core system):



$ sudo taskset 2 nice -10 ./udp_splurger 1  
1: Writing simple data packets...
1: Runtime to send 131072 packets: 1.974865 seconds (740.690268 Mbits per second)
1: Runtime to send 131072 packets: 1.973994 seconds (741.017183 Mbits per second)
1: Runtime to send 131072 packets: 1.975528 seconds (740.441811 Mbits per second)


The curiosity is what happens when I then try to run two processes. so:



$ sudo taskset 2 nice -10 ./udp_splurger 1 & sudo taskset 1 nice -10 ./udp_splurger 2
[3] 1578
2: Writing simple data packets...
1: Writing simple data packets...
1: Runtime to send 131072 packets: 1.581942 seconds (924.662958 Mbits per second)
1: Runtime to send 131072 packets: 1.586901 seconds (921.773395 Mbits per second)
1: Runtime to send 131072 packets: 1.579631 seconds (926.016226 Mbits per second)
2: Runtime to send 131072 packets: 7.471531 seconds (195.778279 Mbits per second)
2: Runtime to send 131072 packets: 3.004867 seconds (486.798071 Mbits per second)
2: Runtime to send 131072 packets: 3.003318 seconds (487.049127 Mbits per second)


When both processes are running, the net CPU time are something like 50% and 42%.



On one level, this sort of seems ok - we've increased the total CPU usage to a CPU bound problem and achieved a greater throughput. However, I can't understand why one process is suddenly apparently much faster.



I initially wondered whether the second process is starving the interrupt handler of CPU time, which means the interrupts are coalescing and so the total time given to interrupt handling is reduced, but I would have thought a simple CPU stress program would have the same result, but it doesn't (unless we need some system calls?).



Any ideas what is going on?



Can I make this happen without the roundabout mechanism of running two processes?



I can confirm the correct number of packets are being received on the destination interface, so I'm confident those numbers are not just simply wrong.







c linux sockets optimization udp






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 23 '18 at 14:15







Henry Gomersall

















asked Nov 23 '18 at 13:59









Henry GomersallHenry Gomersall

5,43621840




5,43621840








  • 1





    Too many possibilities to know without knowing the exact hardware. For example, the hardware could be throttling CPU speed until load exceeds some threshold, or it could be a hardware quirk or even a driver "bug" specific to that SBC. Anyway, instead of using several processes, just use pthreads. Each thread is scheduled just like a process in Linux.

    – Nominal Animal
    Nov 23 '18 at 14:57











  • @NominalAnimal yeah, that's the plan - I just wanted to do a quick test with a pair of processes to see if it would increase throughput.

    – Henry Gomersall
    Nov 23 '18 at 15:46











  • You should get near-maximum UDP packet rate using two, three, or four threads, and each sending several UDP packets per syscall using sendmmsg(). At high packet rates the context switches (between userspace and kernel) become a limitation, so using sendmmsg() to send multiple messages per call will help. (The sender address will be the same, and depend on the socket you use for the call, but each packet can have a different recipient and/or contents.)

    – Nominal Animal
    Nov 23 '18 at 20:02











  • yeah, I've been exploring all of those things - including with AF_PACKET sockets and PACKET_MMAP. Each little thing gives an incremental improvement with a steady increase in complexity!

    – Henry Gomersall
    Nov 23 '18 at 20:31














  • 1





    Too many possibilities to know without knowing the exact hardware. For example, the hardware could be throttling CPU speed until load exceeds some threshold, or it could be a hardware quirk or even a driver "bug" specific to that SBC. Anyway, instead of using several processes, just use pthreads. Each thread is scheduled just like a process in Linux.

    – Nominal Animal
    Nov 23 '18 at 14:57











  • @NominalAnimal yeah, that's the plan - I just wanted to do a quick test with a pair of processes to see if it would increase throughput.

    – Henry Gomersall
    Nov 23 '18 at 15:46











  • You should get near-maximum UDP packet rate using two, three, or four threads, and each sending several UDP packets per syscall using sendmmsg(). At high packet rates the context switches (between userspace and kernel) become a limitation, so using sendmmsg() to send multiple messages per call will help. (The sender address will be the same, and depend on the socket you use for the call, but each packet can have a different recipient and/or contents.)

    – Nominal Animal
    Nov 23 '18 at 20:02











  • yeah, I've been exploring all of those things - including with AF_PACKET sockets and PACKET_MMAP. Each little thing gives an incremental improvement with a steady increase in complexity!

    – Henry Gomersall
    Nov 23 '18 at 20:31








1




1





Too many possibilities to know without knowing the exact hardware. For example, the hardware could be throttling CPU speed until load exceeds some threshold, or it could be a hardware quirk or even a driver "bug" specific to that SBC. Anyway, instead of using several processes, just use pthreads. Each thread is scheduled just like a process in Linux.

– Nominal Animal
Nov 23 '18 at 14:57





Too many possibilities to know without knowing the exact hardware. For example, the hardware could be throttling CPU speed until load exceeds some threshold, or it could be a hardware quirk or even a driver "bug" specific to that SBC. Anyway, instead of using several processes, just use pthreads. Each thread is scheduled just like a process in Linux.

– Nominal Animal
Nov 23 '18 at 14:57













@NominalAnimal yeah, that's the plan - I just wanted to do a quick test with a pair of processes to see if it would increase throughput.

– Henry Gomersall
Nov 23 '18 at 15:46





@NominalAnimal yeah, that's the plan - I just wanted to do a quick test with a pair of processes to see if it would increase throughput.

– Henry Gomersall
Nov 23 '18 at 15:46













You should get near-maximum UDP packet rate using two, three, or four threads, and each sending several UDP packets per syscall using sendmmsg(). At high packet rates the context switches (between userspace and kernel) become a limitation, so using sendmmsg() to send multiple messages per call will help. (The sender address will be the same, and depend on the socket you use for the call, but each packet can have a different recipient and/or contents.)

– Nominal Animal
Nov 23 '18 at 20:02





You should get near-maximum UDP packet rate using two, three, or four threads, and each sending several UDP packets per syscall using sendmmsg(). At high packet rates the context switches (between userspace and kernel) become a limitation, so using sendmmsg() to send multiple messages per call will help. (The sender address will be the same, and depend on the socket you use for the call, but each packet can have a different recipient and/or contents.)

– Nominal Animal
Nov 23 '18 at 20:02













yeah, I've been exploring all of those things - including with AF_PACKET sockets and PACKET_MMAP. Each little thing gives an incremental improvement with a steady increase in complexity!

– Henry Gomersall
Nov 23 '18 at 20:31





yeah, I've been exploring all of those things - including with AF_PACKET sockets and PACKET_MMAP. Each little thing gives an incremental improvement with a steady increase in complexity!

– Henry Gomersall
Nov 23 '18 at 20:31












0






active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53448067%2fcurious-udp-throughput-phenomenon%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53448067%2fcurious-udp-throughput-phenomenon%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

404 Error Contact Form 7 ajax form submitting

How to know if a Active Directory user can login interactively

Refactoring coordinates for Minecraft Pi buildings written in Python