How to optimise the re-ordering of an array?












3














I want to optimize the re-ordering of some data arrays that hold about 4 million unsigned shorts. The aim is to process a data stream by bringing values that should be similar to each other to be close to each other. The pseudo-code is like this:



  for( i=0; i<n; i++)
dest[i] = src[ idx[i] ] ;


To optimize the code for a specific list of idx[i] I tried to compile a 4 million line c function with the idx values filled in:



void reorder( unsigned short * restrict i, unsigned short * restrict o) {
o[0]=i[2075723];
o[1]=i[2075724];
o[2]=i[2075722];
...
o[4194301]=i[4192257];
o[4194302]=i[4192256];
o[4194303]=i[4190208];
}


I had hoped to get GCC to create a clever stream of pshufw/pblend/unpack instructions ... instead it hangs after using up a lot of memory (7 GB). I was trying to make copy based version to avoid the complications of doing swaps in place.



Would anyone be able to suggest good ways to produce optimized the code for this problem? So far I tried:




  • ordered reading, random writes : 60 ms (openmp didn't help)

  • ordered writing, random reads : 20 ms (openmp -> 4 ms)


I was hoping to end up getting closer to the memory bandwidth (order 0.4 ms). A scheme that takes into account cache size and does blocking should help but I don't know where to start for designing one to do it. I also wonder if there is a simple way to exploit SIMD instructions?



Making a toy example with transpose I couldn't even get gcc to output an SIMD version, see:



https://godbolt.org/z/bzGWad



Is this a difficult problem for compilers or am I missing something simple?



Edit 21/11/2018 Added a complete but minimal example of the problem



Here is a complete example of the problem I am trying to optimise. In reality the ordering is a more complicated function, but the point is just to order data pixels according to their distance from the image center, like unwinding a spiral.



#include <omp.h>
#include <vector>
#include <cmath>
#include <iostream>
#include <iomanip>
#include <algorithm>

#define N 2048

// Sorting on output, one core
void reorder_simple( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output){
for( int i=0; i<N*N; i++)
output[i] = input[ indices[i] ];
}
// Sorting on output write, many cores
void reorder_omp( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output){
#pragma omp parallel for
for( int i=0; i<N*N; i++)
output[i] = input[ indices[i] ];
}
// Benchmark for memory throughput, one core
void copy_simple( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output){
for( int i=0; i<N*N; i++)
output[i] = input[i];
}
// Benchmark for memory throughput, many cores
void copy_omp ( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output){
#pragma omp parallel for
for( int i=0; i<N*N; i++)
output[i] = input[i];
}

// Macro to avoid retyping
#define bench(func)
func( indices, input, output);
start = omp_get_wtime();
for( size_t i=0; i<100; i++)
func( indices, input, output );
end = omp_get_wtime();
std:: cout << std::setw(15) << #func <<
", Time taken: " << (end-start)/100 << " /sn";

int main()
{
std::vector<float> sort_order(N*N);
std::vector<size_t> indices(N*N);
float radius, azimuth, ci, cj;
double start, end;
unsigned short *input, *output;

ci = N*0.496; // changes according to calibration
cj = N*0.4985; // reality is more complicated (tilts etc)
for( size_t i=0; i<N; i++){
for( size_t j=0; j<N; j++){
radius = sqrt( (i-ci)*(i-ci) + (j-cj)*(j-cj) );
azimuth = atan2( i-ci, j-cj ); // from -pi to pi
sort_order[i*N+j] = round( radius ) + azimuth/2/M_PI;
indices[i*N+j] = i*N+j;
}
}
// Find the order to sort data onto a spiral
std::sort( indices.begin(), indices.end(),
[&sort_order](int i, int j){
return sort_order[i] < sort_order[j]; });
// Invent some test data
input = new unsigned short [N*N];
output = new unsigned short [N*N];
for( size_t i=0 ; i<N*N; i++){
input[i] = i;
output[i]= 0;
}
// some timing:
bench(reorder_simple);
bench(reorder_omp) ;
bench(copy_simple) ;
bench(copy_omp) ;
}


% g++ reorder.cpp -o reorder -std=c++11 -O3 -march=native -fopenmp -Wall
% ./reorder
reorder_simple, Time taken: 0.0179023 /s
reorder_omp, Time taken: 0.00349932 /s
copy_simple, Time taken: 0.00140805 /s
copy_omp, Time taken: 0.000250205 /s


I would like to get the reorder_omp function to be closer to the speed of the copy_omp function. Detectors can run at 500 frames per second so 3.5 ms is bad in comparison to 0.25 ms.



Edit again: 21/11/2018 the code to write the function that does not compile



  //top of file
#include <fstream>
...
//just before the end:
std::ofstream out;
out.open("cfunc.c");
out << "void cfunc( unsigned short * restrict input,n" <<
" unsigned short * restrict output){ n";
for(int i=0;i<N;i++)
for(int j=0;j<N;j++)
out << "output[" << i*N+j << "] = input[" << indices[i*N+j] << "];n";
out << "}n";
out.close();


Testing this on a different machine I am getting compiler errors from both gcc (7.3.0) and clang (6.0.0). It compiles and runs with tcc (0.9.27) but finishes slower than the looping over the indices.










share|improve this question




















  • 3




    So idx is a compile-time constant? Is there a lot of locality in it? Like groups of destination elements that come from nearby source elements? But there's no pattern to it so you can't make a loop other than by using gather instructions or scalar? Is this x86? What microarchitectures are you tuning for?
    – Peter Cordes
    Nov 15 '18 at 14:36








  • 3




    What is the pattern? The two triplets shown have a output + (0, 1, 2) = input + (1, 2, 0) and output + (0, 1, 2) = input + (1, 0, 2) pattern. Writing the same code out 40 times, let alone 4,000,000 times should give you the shivers (you should write a program to write the program, at least!). I've got a feeling there's an XY Problem here.
    – Jonathan Leffler
    Nov 15 '18 at 14:47






  • 3




    I'm not sure gcc knows how to build shuffles out of scalar loads/stores like that. Even if you use short *__restrict dst to tell it that src and dst don't overlap to make it possible. clang might possibly do something, but I'm not surprised the compile-time memory usage was huge. Your ideas are good; some kind of cache-blocking should be helpful to group reads and writes into small sets of a few (less than 8) cache lines, preferably with some coarser locality with respect to 4k pages too.
    – Peter Cordes
    Nov 15 '18 at 14:47








  • 2




    BTW: are your unsigned shorts wider than 16 bits?
    – joop
    Nov 15 '18 at 15:02










  • idx is a compile time constant. This is x86 for now but perhaps we should go to a gpu instead. Idx would change from day to day but remain constant for a block of data (some millions of images). The data are 16 bit. There are patterns, the target output is roughly a spiral from the original rectangle image from the detector Perhaps I should put this info above?
    – Jon
    Nov 15 '18 at 16:55


















3














I want to optimize the re-ordering of some data arrays that hold about 4 million unsigned shorts. The aim is to process a data stream by bringing values that should be similar to each other to be close to each other. The pseudo-code is like this:



  for( i=0; i<n; i++)
dest[i] = src[ idx[i] ] ;


To optimize the code for a specific list of idx[i] I tried to compile a 4 million line c function with the idx values filled in:



void reorder( unsigned short * restrict i, unsigned short * restrict o) {
o[0]=i[2075723];
o[1]=i[2075724];
o[2]=i[2075722];
...
o[4194301]=i[4192257];
o[4194302]=i[4192256];
o[4194303]=i[4190208];
}


I had hoped to get GCC to create a clever stream of pshufw/pblend/unpack instructions ... instead it hangs after using up a lot of memory (7 GB). I was trying to make copy based version to avoid the complications of doing swaps in place.



Would anyone be able to suggest good ways to produce optimized the code for this problem? So far I tried:




  • ordered reading, random writes : 60 ms (openmp didn't help)

  • ordered writing, random reads : 20 ms (openmp -> 4 ms)


I was hoping to end up getting closer to the memory bandwidth (order 0.4 ms). A scheme that takes into account cache size and does blocking should help but I don't know where to start for designing one to do it. I also wonder if there is a simple way to exploit SIMD instructions?



Making a toy example with transpose I couldn't even get gcc to output an SIMD version, see:



https://godbolt.org/z/bzGWad



Is this a difficult problem for compilers or am I missing something simple?



Edit 21/11/2018 Added a complete but minimal example of the problem



Here is a complete example of the problem I am trying to optimise. In reality the ordering is a more complicated function, but the point is just to order data pixels according to their distance from the image center, like unwinding a spiral.



#include <omp.h>
#include <vector>
#include <cmath>
#include <iostream>
#include <iomanip>
#include <algorithm>

#define N 2048

// Sorting on output, one core
void reorder_simple( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output){
for( int i=0; i<N*N; i++)
output[i] = input[ indices[i] ];
}
// Sorting on output write, many cores
void reorder_omp( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output){
#pragma omp parallel for
for( int i=0; i<N*N; i++)
output[i] = input[ indices[i] ];
}
// Benchmark for memory throughput, one core
void copy_simple( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output){
for( int i=0; i<N*N; i++)
output[i] = input[i];
}
// Benchmark for memory throughput, many cores
void copy_omp ( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output){
#pragma omp parallel for
for( int i=0; i<N*N; i++)
output[i] = input[i];
}

// Macro to avoid retyping
#define bench(func)
func( indices, input, output);
start = omp_get_wtime();
for( size_t i=0; i<100; i++)
func( indices, input, output );
end = omp_get_wtime();
std:: cout << std::setw(15) << #func <<
", Time taken: " << (end-start)/100 << " /sn";

int main()
{
std::vector<float> sort_order(N*N);
std::vector<size_t> indices(N*N);
float radius, azimuth, ci, cj;
double start, end;
unsigned short *input, *output;

ci = N*0.496; // changes according to calibration
cj = N*0.4985; // reality is more complicated (tilts etc)
for( size_t i=0; i<N; i++){
for( size_t j=0; j<N; j++){
radius = sqrt( (i-ci)*(i-ci) + (j-cj)*(j-cj) );
azimuth = atan2( i-ci, j-cj ); // from -pi to pi
sort_order[i*N+j] = round( radius ) + azimuth/2/M_PI;
indices[i*N+j] = i*N+j;
}
}
// Find the order to sort data onto a spiral
std::sort( indices.begin(), indices.end(),
[&sort_order](int i, int j){
return sort_order[i] < sort_order[j]; });
// Invent some test data
input = new unsigned short [N*N];
output = new unsigned short [N*N];
for( size_t i=0 ; i<N*N; i++){
input[i] = i;
output[i]= 0;
}
// some timing:
bench(reorder_simple);
bench(reorder_omp) ;
bench(copy_simple) ;
bench(copy_omp) ;
}


% g++ reorder.cpp -o reorder -std=c++11 -O3 -march=native -fopenmp -Wall
% ./reorder
reorder_simple, Time taken: 0.0179023 /s
reorder_omp, Time taken: 0.00349932 /s
copy_simple, Time taken: 0.00140805 /s
copy_omp, Time taken: 0.000250205 /s


I would like to get the reorder_omp function to be closer to the speed of the copy_omp function. Detectors can run at 500 frames per second so 3.5 ms is bad in comparison to 0.25 ms.



Edit again: 21/11/2018 the code to write the function that does not compile



  //top of file
#include <fstream>
...
//just before the end:
std::ofstream out;
out.open("cfunc.c");
out << "void cfunc( unsigned short * restrict input,n" <<
" unsigned short * restrict output){ n";
for(int i=0;i<N;i++)
for(int j=0;j<N;j++)
out << "output[" << i*N+j << "] = input[" << indices[i*N+j] << "];n";
out << "}n";
out.close();


Testing this on a different machine I am getting compiler errors from both gcc (7.3.0) and clang (6.0.0). It compiles and runs with tcc (0.9.27) but finishes slower than the looping over the indices.










share|improve this question




















  • 3




    So idx is a compile-time constant? Is there a lot of locality in it? Like groups of destination elements that come from nearby source elements? But there's no pattern to it so you can't make a loop other than by using gather instructions or scalar? Is this x86? What microarchitectures are you tuning for?
    – Peter Cordes
    Nov 15 '18 at 14:36








  • 3




    What is the pattern? The two triplets shown have a output + (0, 1, 2) = input + (1, 2, 0) and output + (0, 1, 2) = input + (1, 0, 2) pattern. Writing the same code out 40 times, let alone 4,000,000 times should give you the shivers (you should write a program to write the program, at least!). I've got a feeling there's an XY Problem here.
    – Jonathan Leffler
    Nov 15 '18 at 14:47






  • 3




    I'm not sure gcc knows how to build shuffles out of scalar loads/stores like that. Even if you use short *__restrict dst to tell it that src and dst don't overlap to make it possible. clang might possibly do something, but I'm not surprised the compile-time memory usage was huge. Your ideas are good; some kind of cache-blocking should be helpful to group reads and writes into small sets of a few (less than 8) cache lines, preferably with some coarser locality with respect to 4k pages too.
    – Peter Cordes
    Nov 15 '18 at 14:47








  • 2




    BTW: are your unsigned shorts wider than 16 bits?
    – joop
    Nov 15 '18 at 15:02










  • idx is a compile time constant. This is x86 for now but perhaps we should go to a gpu instead. Idx would change from day to day but remain constant for a block of data (some millions of images). The data are 16 bit. There are patterns, the target output is roughly a spiral from the original rectangle image from the detector Perhaps I should put this info above?
    – Jon
    Nov 15 '18 at 16:55
















3












3








3


1





I want to optimize the re-ordering of some data arrays that hold about 4 million unsigned shorts. The aim is to process a data stream by bringing values that should be similar to each other to be close to each other. The pseudo-code is like this:



  for( i=0; i<n; i++)
dest[i] = src[ idx[i] ] ;


To optimize the code for a specific list of idx[i] I tried to compile a 4 million line c function with the idx values filled in:



void reorder( unsigned short * restrict i, unsigned short * restrict o) {
o[0]=i[2075723];
o[1]=i[2075724];
o[2]=i[2075722];
...
o[4194301]=i[4192257];
o[4194302]=i[4192256];
o[4194303]=i[4190208];
}


I had hoped to get GCC to create a clever stream of pshufw/pblend/unpack instructions ... instead it hangs after using up a lot of memory (7 GB). I was trying to make copy based version to avoid the complications of doing swaps in place.



Would anyone be able to suggest good ways to produce optimized the code for this problem? So far I tried:




  • ordered reading, random writes : 60 ms (openmp didn't help)

  • ordered writing, random reads : 20 ms (openmp -> 4 ms)


I was hoping to end up getting closer to the memory bandwidth (order 0.4 ms). A scheme that takes into account cache size and does blocking should help but I don't know where to start for designing one to do it. I also wonder if there is a simple way to exploit SIMD instructions?



Making a toy example with transpose I couldn't even get gcc to output an SIMD version, see:



https://godbolt.org/z/bzGWad



Is this a difficult problem for compilers or am I missing something simple?



Edit 21/11/2018 Added a complete but minimal example of the problem



Here is a complete example of the problem I am trying to optimise. In reality the ordering is a more complicated function, but the point is just to order data pixels according to their distance from the image center, like unwinding a spiral.



#include <omp.h>
#include <vector>
#include <cmath>
#include <iostream>
#include <iomanip>
#include <algorithm>

#define N 2048

// Sorting on output, one core
void reorder_simple( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output){
for( int i=0; i<N*N; i++)
output[i] = input[ indices[i] ];
}
// Sorting on output write, many cores
void reorder_omp( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output){
#pragma omp parallel for
for( int i=0; i<N*N; i++)
output[i] = input[ indices[i] ];
}
// Benchmark for memory throughput, one core
void copy_simple( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output){
for( int i=0; i<N*N; i++)
output[i] = input[i];
}
// Benchmark for memory throughput, many cores
void copy_omp ( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output){
#pragma omp parallel for
for( int i=0; i<N*N; i++)
output[i] = input[i];
}

// Macro to avoid retyping
#define bench(func)
func( indices, input, output);
start = omp_get_wtime();
for( size_t i=0; i<100; i++)
func( indices, input, output );
end = omp_get_wtime();
std:: cout << std::setw(15) << #func <<
", Time taken: " << (end-start)/100 << " /sn";

int main()
{
std::vector<float> sort_order(N*N);
std::vector<size_t> indices(N*N);
float radius, azimuth, ci, cj;
double start, end;
unsigned short *input, *output;

ci = N*0.496; // changes according to calibration
cj = N*0.4985; // reality is more complicated (tilts etc)
for( size_t i=0; i<N; i++){
for( size_t j=0; j<N; j++){
radius = sqrt( (i-ci)*(i-ci) + (j-cj)*(j-cj) );
azimuth = atan2( i-ci, j-cj ); // from -pi to pi
sort_order[i*N+j] = round( radius ) + azimuth/2/M_PI;
indices[i*N+j] = i*N+j;
}
}
// Find the order to sort data onto a spiral
std::sort( indices.begin(), indices.end(),
[&sort_order](int i, int j){
return sort_order[i] < sort_order[j]; });
// Invent some test data
input = new unsigned short [N*N];
output = new unsigned short [N*N];
for( size_t i=0 ; i<N*N; i++){
input[i] = i;
output[i]= 0;
}
// some timing:
bench(reorder_simple);
bench(reorder_omp) ;
bench(copy_simple) ;
bench(copy_omp) ;
}


% g++ reorder.cpp -o reorder -std=c++11 -O3 -march=native -fopenmp -Wall
% ./reorder
reorder_simple, Time taken: 0.0179023 /s
reorder_omp, Time taken: 0.00349932 /s
copy_simple, Time taken: 0.00140805 /s
copy_omp, Time taken: 0.000250205 /s


I would like to get the reorder_omp function to be closer to the speed of the copy_omp function. Detectors can run at 500 frames per second so 3.5 ms is bad in comparison to 0.25 ms.



Edit again: 21/11/2018 the code to write the function that does not compile



  //top of file
#include <fstream>
...
//just before the end:
std::ofstream out;
out.open("cfunc.c");
out << "void cfunc( unsigned short * restrict input,n" <<
" unsigned short * restrict output){ n";
for(int i=0;i<N;i++)
for(int j=0;j<N;j++)
out << "output[" << i*N+j << "] = input[" << indices[i*N+j] << "];n";
out << "}n";
out.close();


Testing this on a different machine I am getting compiler errors from both gcc (7.3.0) and clang (6.0.0). It compiles and runs with tcc (0.9.27) but finishes slower than the looping over the indices.










share|improve this question















I want to optimize the re-ordering of some data arrays that hold about 4 million unsigned shorts. The aim is to process a data stream by bringing values that should be similar to each other to be close to each other. The pseudo-code is like this:



  for( i=0; i<n; i++)
dest[i] = src[ idx[i] ] ;


To optimize the code for a specific list of idx[i] I tried to compile a 4 million line c function with the idx values filled in:



void reorder( unsigned short * restrict i, unsigned short * restrict o) {
o[0]=i[2075723];
o[1]=i[2075724];
o[2]=i[2075722];
...
o[4194301]=i[4192257];
o[4194302]=i[4192256];
o[4194303]=i[4190208];
}


I had hoped to get GCC to create a clever stream of pshufw/pblend/unpack instructions ... instead it hangs after using up a lot of memory (7 GB). I was trying to make copy based version to avoid the complications of doing swaps in place.



Would anyone be able to suggest good ways to produce optimized the code for this problem? So far I tried:




  • ordered reading, random writes : 60 ms (openmp didn't help)

  • ordered writing, random reads : 20 ms (openmp -> 4 ms)


I was hoping to end up getting closer to the memory bandwidth (order 0.4 ms). A scheme that takes into account cache size and does blocking should help but I don't know where to start for designing one to do it. I also wonder if there is a simple way to exploit SIMD instructions?



Making a toy example with transpose I couldn't even get gcc to output an SIMD version, see:



https://godbolt.org/z/bzGWad



Is this a difficult problem for compilers or am I missing something simple?



Edit 21/11/2018 Added a complete but minimal example of the problem



Here is a complete example of the problem I am trying to optimise. In reality the ordering is a more complicated function, but the point is just to order data pixels according to their distance from the image center, like unwinding a spiral.



#include <omp.h>
#include <vector>
#include <cmath>
#include <iostream>
#include <iomanip>
#include <algorithm>

#define N 2048

// Sorting on output, one core
void reorder_simple( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output){
for( int i=0; i<N*N; i++)
output[i] = input[ indices[i] ];
}
// Sorting on output write, many cores
void reorder_omp( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output){
#pragma omp parallel for
for( int i=0; i<N*N; i++)
output[i] = input[ indices[i] ];
}
// Benchmark for memory throughput, one core
void copy_simple( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output){
for( int i=0; i<N*N; i++)
output[i] = input[i];
}
// Benchmark for memory throughput, many cores
void copy_omp ( const std::vector<size_t> &indices,
const unsigned short input,
unsigned short output){
#pragma omp parallel for
for( int i=0; i<N*N; i++)
output[i] = input[i];
}

// Macro to avoid retyping
#define bench(func)
func( indices, input, output);
start = omp_get_wtime();
for( size_t i=0; i<100; i++)
func( indices, input, output );
end = omp_get_wtime();
std:: cout << std::setw(15) << #func <<
", Time taken: " << (end-start)/100 << " /sn";

int main()
{
std::vector<float> sort_order(N*N);
std::vector<size_t> indices(N*N);
float radius, azimuth, ci, cj;
double start, end;
unsigned short *input, *output;

ci = N*0.496; // changes according to calibration
cj = N*0.4985; // reality is more complicated (tilts etc)
for( size_t i=0; i<N; i++){
for( size_t j=0; j<N; j++){
radius = sqrt( (i-ci)*(i-ci) + (j-cj)*(j-cj) );
azimuth = atan2( i-ci, j-cj ); // from -pi to pi
sort_order[i*N+j] = round( radius ) + azimuth/2/M_PI;
indices[i*N+j] = i*N+j;
}
}
// Find the order to sort data onto a spiral
std::sort( indices.begin(), indices.end(),
[&sort_order](int i, int j){
return sort_order[i] < sort_order[j]; });
// Invent some test data
input = new unsigned short [N*N];
output = new unsigned short [N*N];
for( size_t i=0 ; i<N*N; i++){
input[i] = i;
output[i]= 0;
}
// some timing:
bench(reorder_simple);
bench(reorder_omp) ;
bench(copy_simple) ;
bench(copy_omp) ;
}


% g++ reorder.cpp -o reorder -std=c++11 -O3 -march=native -fopenmp -Wall
% ./reorder
reorder_simple, Time taken: 0.0179023 /s
reorder_omp, Time taken: 0.00349932 /s
copy_simple, Time taken: 0.00140805 /s
copy_omp, Time taken: 0.000250205 /s


I would like to get the reorder_omp function to be closer to the speed of the copy_omp function. Detectors can run at 500 frames per second so 3.5 ms is bad in comparison to 0.25 ms.



Edit again: 21/11/2018 the code to write the function that does not compile



  //top of file
#include <fstream>
...
//just before the end:
std::ofstream out;
out.open("cfunc.c");
out << "void cfunc( unsigned short * restrict input,n" <<
" unsigned short * restrict output){ n";
for(int i=0;i<N;i++)
for(int j=0;j<N;j++)
out << "output[" << i*N+j << "] = input[" << indices[i*N+j] << "];n";
out << "}n";
out.close();


Testing this on a different machine I am getting compiler errors from both gcc (7.3.0) and clang (6.0.0). It compiles and runs with tcc (0.9.27) but finishes slower than the looping over the indices.







c++ arrays gcc optimization compiler-optimization






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 21 '18 at 23:10

























asked Nov 15 '18 at 14:25









Jon

62649




62649








  • 3




    So idx is a compile-time constant? Is there a lot of locality in it? Like groups of destination elements that come from nearby source elements? But there's no pattern to it so you can't make a loop other than by using gather instructions or scalar? Is this x86? What microarchitectures are you tuning for?
    – Peter Cordes
    Nov 15 '18 at 14:36








  • 3




    What is the pattern? The two triplets shown have a output + (0, 1, 2) = input + (1, 2, 0) and output + (0, 1, 2) = input + (1, 0, 2) pattern. Writing the same code out 40 times, let alone 4,000,000 times should give you the shivers (you should write a program to write the program, at least!). I've got a feeling there's an XY Problem here.
    – Jonathan Leffler
    Nov 15 '18 at 14:47






  • 3




    I'm not sure gcc knows how to build shuffles out of scalar loads/stores like that. Even if you use short *__restrict dst to tell it that src and dst don't overlap to make it possible. clang might possibly do something, but I'm not surprised the compile-time memory usage was huge. Your ideas are good; some kind of cache-blocking should be helpful to group reads and writes into small sets of a few (less than 8) cache lines, preferably with some coarser locality with respect to 4k pages too.
    – Peter Cordes
    Nov 15 '18 at 14:47








  • 2




    BTW: are your unsigned shorts wider than 16 bits?
    – joop
    Nov 15 '18 at 15:02










  • idx is a compile time constant. This is x86 for now but perhaps we should go to a gpu instead. Idx would change from day to day but remain constant for a block of data (some millions of images). The data are 16 bit. There are patterns, the target output is roughly a spiral from the original rectangle image from the detector Perhaps I should put this info above?
    – Jon
    Nov 15 '18 at 16:55
















  • 3




    So idx is a compile-time constant? Is there a lot of locality in it? Like groups of destination elements that come from nearby source elements? But there's no pattern to it so you can't make a loop other than by using gather instructions or scalar? Is this x86? What microarchitectures are you tuning for?
    – Peter Cordes
    Nov 15 '18 at 14:36








  • 3




    What is the pattern? The two triplets shown have a output + (0, 1, 2) = input + (1, 2, 0) and output + (0, 1, 2) = input + (1, 0, 2) pattern. Writing the same code out 40 times, let alone 4,000,000 times should give you the shivers (you should write a program to write the program, at least!). I've got a feeling there's an XY Problem here.
    – Jonathan Leffler
    Nov 15 '18 at 14:47






  • 3




    I'm not sure gcc knows how to build shuffles out of scalar loads/stores like that. Even if you use short *__restrict dst to tell it that src and dst don't overlap to make it possible. clang might possibly do something, but I'm not surprised the compile-time memory usage was huge. Your ideas are good; some kind of cache-blocking should be helpful to group reads and writes into small sets of a few (less than 8) cache lines, preferably with some coarser locality with respect to 4k pages too.
    – Peter Cordes
    Nov 15 '18 at 14:47








  • 2




    BTW: are your unsigned shorts wider than 16 bits?
    – joop
    Nov 15 '18 at 15:02










  • idx is a compile time constant. This is x86 for now but perhaps we should go to a gpu instead. Idx would change from day to day but remain constant for a block of data (some millions of images). The data are 16 bit. There are patterns, the target output is roughly a spiral from the original rectangle image from the detector Perhaps I should put this info above?
    – Jon
    Nov 15 '18 at 16:55










3




3




So idx is a compile-time constant? Is there a lot of locality in it? Like groups of destination elements that come from nearby source elements? But there's no pattern to it so you can't make a loop other than by using gather instructions or scalar? Is this x86? What microarchitectures are you tuning for?
– Peter Cordes
Nov 15 '18 at 14:36






So idx is a compile-time constant? Is there a lot of locality in it? Like groups of destination elements that come from nearby source elements? But there's no pattern to it so you can't make a loop other than by using gather instructions or scalar? Is this x86? What microarchitectures are you tuning for?
– Peter Cordes
Nov 15 '18 at 14:36






3




3




What is the pattern? The two triplets shown have a output + (0, 1, 2) = input + (1, 2, 0) and output + (0, 1, 2) = input + (1, 0, 2) pattern. Writing the same code out 40 times, let alone 4,000,000 times should give you the shivers (you should write a program to write the program, at least!). I've got a feeling there's an XY Problem here.
– Jonathan Leffler
Nov 15 '18 at 14:47




What is the pattern? The two triplets shown have a output + (0, 1, 2) = input + (1, 2, 0) and output + (0, 1, 2) = input + (1, 0, 2) pattern. Writing the same code out 40 times, let alone 4,000,000 times should give you the shivers (you should write a program to write the program, at least!). I've got a feeling there's an XY Problem here.
– Jonathan Leffler
Nov 15 '18 at 14:47




3




3




I'm not sure gcc knows how to build shuffles out of scalar loads/stores like that. Even if you use short *__restrict dst to tell it that src and dst don't overlap to make it possible. clang might possibly do something, but I'm not surprised the compile-time memory usage was huge. Your ideas are good; some kind of cache-blocking should be helpful to group reads and writes into small sets of a few (less than 8) cache lines, preferably with some coarser locality with respect to 4k pages too.
– Peter Cordes
Nov 15 '18 at 14:47






I'm not sure gcc knows how to build shuffles out of scalar loads/stores like that. Even if you use short *__restrict dst to tell it that src and dst don't overlap to make it possible. clang might possibly do something, but I'm not surprised the compile-time memory usage was huge. Your ideas are good; some kind of cache-blocking should be helpful to group reads and writes into small sets of a few (less than 8) cache lines, preferably with some coarser locality with respect to 4k pages too.
– Peter Cordes
Nov 15 '18 at 14:47






2




2




BTW: are your unsigned shorts wider than 16 bits?
– joop
Nov 15 '18 at 15:02




BTW: are your unsigned shorts wider than 16 bits?
– joop
Nov 15 '18 at 15:02












idx is a compile time constant. This is x86 for now but perhaps we should go to a gpu instead. Idx would change from day to day but remain constant for a block of data (some millions of images). The data are 16 bit. There are patterns, the target output is roughly a spiral from the original rectangle image from the detector Perhaps I should put this info above?
– Jon
Nov 15 '18 at 16:55






idx is a compile time constant. This is x86 for now but perhaps we should go to a gpu instead. Idx would change from day to day but remain constant for a block of data (some millions of images). The data are 16 bit. There are patterns, the target output is roughly a spiral from the original rectangle image from the detector Perhaps I should put this info above?
– Jon
Nov 15 '18 at 16:55














1 Answer
1






active

oldest

votes


















0














(comment section is too short)



I would test the following idea:





  1. Maintain reverse index table, so that naive algorithm becomes:



     for (i = 0; i<n; i++) {
    dest[index[i]] = src[i];
    }



  2. Instead of using naive algorithm:



    2.1 Create temporary array of pairs (value, destindex)



    struct pair {
    int value;
    int destindex;
    };
    for (i = 0; i < n; i++) {
    pairs[i] = {.value=src[i], .destindex=index[i]};
    }


    2.2 Use merge or quick sort to sort array of pairs by .destindex field



    2.3 Copy values from array of pairs into dest array




There are no random accesses in this algorithm and thus no random access page faults. However, I am not sure that it will work better than naive algorithm because of large number of linear passes.






share|improve this answer





















  • The OP already considered your 1. idea, and found "ordered reading, random writes" 3x slower than "ordered writing, random reads". Sorting could be parallelized and done with better locality, but yeah it's probably not a win.
    – Peter Cordes
    Nov 21 '18 at 17:26










  • I tried something like this, creating small blocks of output (64 values) and ordering these according to what was recently read from improved the run time by about 20%. It helped, but maybe the ordering was still bad for pipelines and there was still no vectorisation.
    – Jon
    Nov 21 '18 at 23:27











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53321584%2fhow-to-optimise-the-re-ordering-of-an-array%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














(comment section is too short)



I would test the following idea:





  1. Maintain reverse index table, so that naive algorithm becomes:



     for (i = 0; i<n; i++) {
    dest[index[i]] = src[i];
    }



  2. Instead of using naive algorithm:



    2.1 Create temporary array of pairs (value, destindex)



    struct pair {
    int value;
    int destindex;
    };
    for (i = 0; i < n; i++) {
    pairs[i] = {.value=src[i], .destindex=index[i]};
    }


    2.2 Use merge or quick sort to sort array of pairs by .destindex field



    2.3 Copy values from array of pairs into dest array




There are no random accesses in this algorithm and thus no random access page faults. However, I am not sure that it will work better than naive algorithm because of large number of linear passes.






share|improve this answer





















  • The OP already considered your 1. idea, and found "ordered reading, random writes" 3x slower than "ordered writing, random reads". Sorting could be parallelized and done with better locality, but yeah it's probably not a win.
    – Peter Cordes
    Nov 21 '18 at 17:26










  • I tried something like this, creating small blocks of output (64 values) and ordering these according to what was recently read from improved the run time by about 20%. It helped, but maybe the ordering was still bad for pipelines and there was still no vectorisation.
    – Jon
    Nov 21 '18 at 23:27
















0














(comment section is too short)



I would test the following idea:





  1. Maintain reverse index table, so that naive algorithm becomes:



     for (i = 0; i<n; i++) {
    dest[index[i]] = src[i];
    }



  2. Instead of using naive algorithm:



    2.1 Create temporary array of pairs (value, destindex)



    struct pair {
    int value;
    int destindex;
    };
    for (i = 0; i < n; i++) {
    pairs[i] = {.value=src[i], .destindex=index[i]};
    }


    2.2 Use merge or quick sort to sort array of pairs by .destindex field



    2.3 Copy values from array of pairs into dest array




There are no random accesses in this algorithm and thus no random access page faults. However, I am not sure that it will work better than naive algorithm because of large number of linear passes.






share|improve this answer





















  • The OP already considered your 1. idea, and found "ordered reading, random writes" 3x slower than "ordered writing, random reads". Sorting could be parallelized and done with better locality, but yeah it's probably not a win.
    – Peter Cordes
    Nov 21 '18 at 17:26










  • I tried something like this, creating small blocks of output (64 values) and ordering these according to what was recently read from improved the run time by about 20%. It helped, but maybe the ordering was still bad for pipelines and there was still no vectorisation.
    – Jon
    Nov 21 '18 at 23:27














0












0








0






(comment section is too short)



I would test the following idea:





  1. Maintain reverse index table, so that naive algorithm becomes:



     for (i = 0; i<n; i++) {
    dest[index[i]] = src[i];
    }



  2. Instead of using naive algorithm:



    2.1 Create temporary array of pairs (value, destindex)



    struct pair {
    int value;
    int destindex;
    };
    for (i = 0; i < n; i++) {
    pairs[i] = {.value=src[i], .destindex=index[i]};
    }


    2.2 Use merge or quick sort to sort array of pairs by .destindex field



    2.3 Copy values from array of pairs into dest array




There are no random accesses in this algorithm and thus no random access page faults. However, I am not sure that it will work better than naive algorithm because of large number of linear passes.






share|improve this answer












(comment section is too short)



I would test the following idea:





  1. Maintain reverse index table, so that naive algorithm becomes:



     for (i = 0; i<n; i++) {
    dest[index[i]] = src[i];
    }



  2. Instead of using naive algorithm:



    2.1 Create temporary array of pairs (value, destindex)



    struct pair {
    int value;
    int destindex;
    };
    for (i = 0; i < n; i++) {
    pairs[i] = {.value=src[i], .destindex=index[i]};
    }


    2.2 Use merge or quick sort to sort array of pairs by .destindex field



    2.3 Copy values from array of pairs into dest array




There are no random accesses in this algorithm and thus no random access page faults. However, I am not sure that it will work better than naive algorithm because of large number of linear passes.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 21 '18 at 17:19









gudok

2,71611022




2,71611022












  • The OP already considered your 1. idea, and found "ordered reading, random writes" 3x slower than "ordered writing, random reads". Sorting could be parallelized and done with better locality, but yeah it's probably not a win.
    – Peter Cordes
    Nov 21 '18 at 17:26










  • I tried something like this, creating small blocks of output (64 values) and ordering these according to what was recently read from improved the run time by about 20%. It helped, but maybe the ordering was still bad for pipelines and there was still no vectorisation.
    – Jon
    Nov 21 '18 at 23:27


















  • The OP already considered your 1. idea, and found "ordered reading, random writes" 3x slower than "ordered writing, random reads". Sorting could be parallelized and done with better locality, but yeah it's probably not a win.
    – Peter Cordes
    Nov 21 '18 at 17:26










  • I tried something like this, creating small blocks of output (64 values) and ordering these according to what was recently read from improved the run time by about 20%. It helped, but maybe the ordering was still bad for pipelines and there was still no vectorisation.
    – Jon
    Nov 21 '18 at 23:27
















The OP already considered your 1. idea, and found "ordered reading, random writes" 3x slower than "ordered writing, random reads". Sorting could be parallelized and done with better locality, but yeah it's probably not a win.
– Peter Cordes
Nov 21 '18 at 17:26




The OP already considered your 1. idea, and found "ordered reading, random writes" 3x slower than "ordered writing, random reads". Sorting could be parallelized and done with better locality, but yeah it's probably not a win.
– Peter Cordes
Nov 21 '18 at 17:26












I tried something like this, creating small blocks of output (64 values) and ordering these according to what was recently read from improved the run time by about 20%. It helped, but maybe the ordering was still bad for pipelines and there was still no vectorisation.
– Jon
Nov 21 '18 at 23:27




I tried something like this, creating small blocks of output (64 values) and ordering these according to what was recently read from improved the run time by about 20%. It helped, but maybe the ordering was still bad for pipelines and there was still no vectorisation.
– Jon
Nov 21 '18 at 23:27


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53321584%2fhow-to-optimise-the-re-ordering-of-an-array%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

404 Error Contact Form 7 ajax form submitting

How to know if a Active Directory user can login interactively

TypeError: fit_transform() missing 1 required positional argument: 'X'