How to optimise the re-ordering of an array?

I want to optimize the re-ordering of some data arrays that hold about 4 million unsigned shorts. The aim is to process a data stream by bringing values that should be similar to each other to be close to each other. The pseudo-code is like this:

  for( i=0; i<n; i++)

    dest[i] = src[ idx[i] ] ;

To optimize the code for a specific list of idx[i] I tried to compile a 4 million line c function with the idx values filled in:

void reorder( unsigned short * restrict i, unsigned short * restrict o) {

  o[0]=i[2075723];

  o[1]=i[2075724];

  o[2]=i[2075722];

  ...

  o[4194301]=i[4192257];

  o[4194302]=i[4192256];

  o[4194303]=i[4190208];

 }

I had hoped to get GCC to create a clever stream of pshufw/pblend/unpack instructions ... instead it hangs after using up a lot of memory (7 GB). I was trying to make copy based version to avoid the complications of doing swaps in place.

Would anyone be able to suggest good ways to produce optimized the code for this problem? So far I tried:

ordered reading, random writes : 60 ms (openmp didn't help)

ordered writing, random reads : 20 ms (openmp -> 4 ms)

I was hoping to end up getting closer to the memory bandwidth (order 0.4 ms). A scheme that takes into account cache size and does blocking should help but I don't know where to start for designing one to do it. I also wonder if there is a simple way to exploit SIMD instructions?

Making a toy example with transpose I couldn't even get gcc to output an SIMD version, see:

https://godbolt.org/z/bzGWad

Is this a difficult problem for compilers or am I missing something simple?

Edit 21/11/2018 Added a complete but minimal example of the problem

Here is a complete example of the problem I am trying to optimise. In reality the ordering is a more complicated function, but the point is just to order data pixels according to their distance from the image center, like unwinding a spiral.

#include <omp.h>

#include <vector>

#include <cmath>

#include <iostream>

#include <iomanip>

#include <algorithm>



#define N 2048



// Sorting on output, one core

void reorder_simple( const std::vector<size_t> &indices,

             const unsigned short input,

             unsigned short output){

  for( int i=0; i<N*N; i++)

    output[i] = input[ indices[i] ];

}

// Sorting on output write, many cores

void reorder_omp( const std::vector<size_t> &indices,

          const unsigned short input,

          unsigned short output){

#pragma omp parallel for

  for( int i=0; i<N*N; i++)

    output[i] = input[ indices[i] ];

}

// Benchmark for memory throughput, one core

void copy_simple(  const std::vector<size_t> &indices,

           const unsigned short input,

           unsigned short output){

  for( int i=0; i<N*N; i++)

    output[i] = input[i];

}

// Benchmark for memory throughput, many cores

void copy_omp (  const std::vector<size_t> &indices,

         const unsigned short input,

         unsigned short output){

#pragma omp parallel for

  for( int i=0; i<N*N; i++)

    output[i] = input[i];

}



// Macro to avoid retyping

#define bench(func)                                          

  func( indices, input, output);                             

  start = omp_get_wtime();                                   

  for( size_t i=0; i<100; i++)                               

      func( indices, input, output );                        

  end =  omp_get_wtime();                                    

  std:: cout << std::setw(15) << #func <<                    

     ", Time taken: "  << (end-start)/100 << " /sn";



int main()

{

  std::vector<float> sort_order(N*N);

  std::vector<size_t> indices(N*N);

  float radius, azimuth, ci, cj;

  double start, end;

  unsigned short *input, *output;



  ci = N*0.496;  // changes according to calibration

  cj = N*0.4985;  // reality is more complicated (tilts etc)

  for( size_t i=0; i<N; i++){

    for( size_t j=0; j<N; j++){

      radius  = sqrt( (i-ci)*(i-ci) + (j-cj)*(j-cj) );

      azimuth = atan2( i-ci, j-cj ); // from -pi to pi

      sort_order[i*N+j] = round( radius ) + azimuth/2/M_PI;

      indices[i*N+j] = i*N+j;

    }

  }

  // Find the order to sort data onto a spiral 

  std::sort( indices.begin(), indices.end(),

         [&sort_order](int i, int j){

           return sort_order[i] < sort_order[j]; });

  // Invent some test data

  input = new unsigned short [N*N];

  output = new unsigned short [N*N];

  for( size_t i=0 ; i<N*N; i++){

    input[i] = i;

    output[i]= 0;

  }

  // some timing:

  bench(reorder_simple);

  bench(reorder_omp)   ;

  bench(copy_simple)   ;

  bench(copy_omp)      ;

}





   % g++ reorder.cpp -o reorder -std=c++11 -O3 -march=native -fopenmp -Wall

   % ./reorder

     reorder_simple, Time taken: 0.0179023 /s

        reorder_omp, Time taken: 0.00349932 /s

        copy_simple, Time taken: 0.00140805 /s

           copy_omp, Time taken: 0.000250205 /s

I would like to get the reorder_omp function to be closer to the speed of the copy_omp function. Detectors can run at 500 frames per second so 3.5 ms is bad in comparison to 0.25 ms.

Edit again: 21/11/2018 the code to write the function that does not compile

  //top of file

  #include <fstream>  

  ...

  //just before the end: 

  std::ofstream out;

  out.open("cfunc.c");

  out << "void cfunc( unsigned short * restrict input,n" <<

         "            unsigned short * restrict output){ n"; 

  for(int i=0;i<N;i++)

    for(int j=0;j<N;j++)

      out << "output[" << i*N+j << "] = input[" << indices[i*N+j] << "];n";

  out << "}n";

  out.close();

Testing this on a different machine I am getting compiler errors from both gcc (7.3.0) and clang (6.0.0). It compiles and runs with tcc (0.9.27) but finishes slower than the looping over the indices.

edited Nov 21 '18 at 23:10

asked Nov 15 '18 at 14:25

Jon

62649

3

So idx is a compile-time constant? Is there a lot of locality in it? Like groups of destination elements that come from nearby source elements? But there's no pattern to it so you can't make a loop other than by using gather instructions or scalar? Is this x86? What microarchitectures are you tuning for?
– Peter Cordes
Nov 15 '18 at 14:36

3

What is the pattern? The two triplets shown have a output + (0, 1, 2) = input + (1, 2, 0) and output + (0, 1, 2) = input + (1, 0, 2) pattern. Writing the same code out 40 times, let alone 4,000,000 times should give you the shivers (you should write a program to write the program, at least!). I've got a feeling there's an XY Problem here.
– Jonathan Leffler
Nov 15 '18 at 14:47

3

I'm not sure gcc knows how to build shuffles out of scalar loads/stores like that. Even if you use short *__restrict dst to tell it that src and dst don't overlap to make it possible. clang might possibly do something, but I'm not surprised the compile-time memory usage was huge. Your ideas are good; some kind of cache-blocking should be helpful to group reads and writes into small sets of a few (less than 8) cache lines, preferably with some coarser locality with respect to 4k pages too.
– Peter Cordes
Nov 15 '18 at 14:47

2

BTW: are your unsigned shorts wider than 16 bits?
– joop
Nov 15 '18 at 15:02

idx is a compile time constant. This is x86 for now but perhaps we should go to a gpu instead. Idx would change from day to day but remain constant for a block of data (some millions of images). The data are 16 bit. There are patterns, the target output is roughly a spiral from the original rectangle image from the detector Perhaps I should put this info above?
– Jon
Nov 15 '18 at 16:55

|
show 3 more comments

  for( i=0; i<n; i++)

    dest[i] = src[ idx[i] ] ;

To optimize the code for a specific list of idx[i] I tried to compile a 4 million line c function with the idx values filled in:

void reorder( unsigned short * restrict i, unsigned short * restrict o) {

  o[0]=i[2075723];

  o[1]=i[2075724];

  o[2]=i[2075722];

  ...

  o[4194301]=i[4192257];

  o[4194302]=i[4192256];

  o[4194303]=i[4190208];

 }

Would anyone be able to suggest good ways to produce optimized the code for this problem? So far I tried:

ordered reading, random writes : 60 ms (openmp didn't help)

ordered writing, random reads : 20 ms (openmp -> 4 ms)

Making a toy example with transpose I couldn't even get gcc to output an SIMD version, see:

https://godbolt.org/z/bzGWad

Is this a difficult problem for compilers or am I missing something simple?

Edit 21/11/2018 Added a complete but minimal example of the problem

#include <omp.h>

#include <vector>

#include <cmath>

#include <iostream>

#include <iomanip>

#include <algorithm>



#define N 2048



// Sorting on output, one core

void reorder_simple( const std::vector<size_t> &indices,

             const unsigned short input,

             unsigned short output){

  for( int i=0; i<N*N; i++)

    output[i] = input[ indices[i] ];

}

// Sorting on output write, many cores

void reorder_omp( const std::vector<size_t> &indices,

          const unsigned short input,

          unsigned short output){

#pragma omp parallel for

  for( int i=0; i<N*N; i++)

    output[i] = input[ indices[i] ];

}

// Benchmark for memory throughput, one core

void copy_simple(  const std::vector<size_t> &indices,

           const unsigned short input,

           unsigned short output){

  for( int i=0; i<N*N; i++)

    output[i] = input[i];

}

// Benchmark for memory throughput, many cores

void copy_omp (  const std::vector<size_t> &indices,

         const unsigned short input,

         unsigned short output){

#pragma omp parallel for

  for( int i=0; i<N*N; i++)

    output[i] = input[i];

}



// Macro to avoid retyping

#define bench(func)                                          

  func( indices, input, output);                             

  start = omp_get_wtime();                                   

  for( size_t i=0; i<100; i++)                               

      func( indices, input, output );                        

  end =  omp_get_wtime();                                    

  std:: cout << std::setw(15) << #func <<                    

     ", Time taken: "  << (end-start)/100 << " /sn";



int main()

{

  std::vector<float> sort_order(N*N);

  std::vector<size_t> indices(N*N);

  float radius, azimuth, ci, cj;

  double start, end;

  unsigned short *input, *output;



  ci = N*0.496;  // changes according to calibration

  cj = N*0.4985;  // reality is more complicated (tilts etc)

  for( size_t i=0; i<N; i++){

    for( size_t j=0; j<N; j++){

      radius  = sqrt( (i-ci)*(i-ci) + (j-cj)*(j-cj) );

      azimuth = atan2( i-ci, j-cj ); // from -pi to pi

      sort_order[i*N+j] = round( radius ) + azimuth/2/M_PI;

      indices[i*N+j] = i*N+j;

    }

  }

  // Find the order to sort data onto a spiral 

  std::sort( indices.begin(), indices.end(),

         [&sort_order](int i, int j){

           return sort_order[i] < sort_order[j]; });

  // Invent some test data

  input = new unsigned short [N*N];

  output = new unsigned short [N*N];

  for( size_t i=0 ; i<N*N; i++){

    input[i] = i;

    output[i]= 0;

  }

  // some timing:

  bench(reorder_simple);

  bench(reorder_omp)   ;

  bench(copy_simple)   ;

  bench(copy_omp)      ;

}





   % g++ reorder.cpp -o reorder -std=c++11 -O3 -march=native -fopenmp -Wall

   % ./reorder

     reorder_simple, Time taken: 0.0179023 /s

        reorder_omp, Time taken: 0.00349932 /s

        copy_simple, Time taken: 0.00140805 /s

           copy_omp, Time taken: 0.000250205 /s

I would like to get the reorder_omp function to be closer to the speed of the copy_omp function. Detectors can run at 500 frames per second so 3.5 ms is bad in comparison to 0.25 ms.

Edit again: 21/11/2018 the code to write the function that does not compile

  //top of file

  #include <fstream>  

  ...

  //just before the end: 

  std::ofstream out;

  out.open("cfunc.c");

  out << "void cfunc( unsigned short * restrict input,n" <<

         "            unsigned short * restrict output){ n"; 

  for(int i=0;i<N;i++)

    for(int j=0;j<N;j++)

      out << "output[" << i*N+j << "] = input[" << indices[i*N+j] << "];n";

  out << "}n";

  out.close();

Testing this on a different machine I am getting compiler errors from both gcc (7.3.0) and clang (6.0.0). It compiles and runs with tcc (0.9.27) but finishes slower than the looping over the indices.

edited Nov 21 '18 at 23:10

asked Nov 15 '18 at 14:25

Jon

62649

3

So idx is a compile-time constant? Is there a lot of locality in it? Like groups of destination elements that come from nearby source elements? But there's no pattern to it so you can't make a loop other than by using gather instructions or scalar? Is this x86? What microarchitectures are you tuning for?
– Peter Cordes
Nov 15 '18 at 14:36

3

What is the pattern? The two triplets shown have a output + (0, 1, 2) = input + (1, 2, 0) and output + (0, 1, 2) = input + (1, 0, 2) pattern. Writing the same code out 40 times, let alone 4,000,000 times should give you the shivers (you should write a program to write the program, at least!). I've got a feeling there's an XY Problem here.
– Jonathan Leffler
Nov 15 '18 at 14:47

3

I'm not sure gcc knows how to build shuffles out of scalar loads/stores like that. Even if you use short *__restrict dst to tell it that src and dst don't overlap to make it possible. clang might possibly do something, but I'm not surprised the compile-time memory usage was huge. Your ideas are good; some kind of cache-blocking should be helpful to group reads and writes into small sets of a few (less than 8) cache lines, preferably with some coarser locality with respect to 4k pages too.
– Peter Cordes
Nov 15 '18 at 14:47

2

BTW: are your unsigned shorts wider than 16 bits?
– joop
Nov 15 '18 at 15:02

idx is a compile time constant. This is x86 for now but perhaps we should go to a gpu instead. Idx would change from day to day but remain constant for a block of data (some millions of images). The data are 16 bit. There are patterns, the target output is roughly a spiral from the original rectangle image from the detector Perhaps I should put this info above?
– Jon
Nov 15 '18 at 16:55

|
show 3 more comments

  for( i=0; i<n; i++)

    dest[i] = src[ idx[i] ] ;

To optimize the code for a specific list of idx[i] I tried to compile a 4 million line c function with the idx values filled in:

void reorder( unsigned short * restrict i, unsigned short * restrict o) {

  o[0]=i[2075723];

  o[1]=i[2075724];

  o[2]=i[2075722];

  ...

  o[4194301]=i[4192257];

  o[4194302]=i[4192256];

  o[4194303]=i[4190208];

 }

Would anyone be able to suggest good ways to produce optimized the code for this problem? So far I tried:

ordered reading, random writes : 60 ms (openmp didn't help)

ordered writing, random reads : 20 ms (openmp -> 4 ms)

Making a toy example with transpose I couldn't even get gcc to output an SIMD version, see:

https://godbolt.org/z/bzGWad

Is this a difficult problem for compilers or am I missing something simple?

Edit 21/11/2018 Added a complete but minimal example of the problem

#include <omp.h>

#include <vector>

#include <cmath>

#include <iostream>

#include <iomanip>

#include <algorithm>



#define N 2048



// Sorting on output, one core

void reorder_simple( const std::vector<size_t> &indices,

             const unsigned short input,

             unsigned short output){

  for( int i=0; i<N*N; i++)

    output[i] = input[ indices[i] ];

}

// Sorting on output write, many cores

void reorder_omp( const std::vector<size_t> &indices,

          const unsigned short input,

          unsigned short output){

#pragma omp parallel for

  for( int i=0; i<N*N; i++)

    output[i] = input[ indices[i] ];

}

// Benchmark for memory throughput, one core

void copy_simple(  const std::vector<size_t> &indices,

           const unsigned short input,

           unsigned short output){

  for( int i=0; i<N*N; i++)

    output[i] = input[i];

}

// Benchmark for memory throughput, many cores

void copy_omp (  const std::vector<size_t> &indices,

         const unsigned short input,

         unsigned short output){

#pragma omp parallel for

  for( int i=0; i<N*N; i++)

    output[i] = input[i];

}



// Macro to avoid retyping

#define bench(func)                                          

  func( indices, input, output);                             

  start = omp_get_wtime();                                   

  for( size_t i=0; i<100; i++)                               

      func( indices, input, output );                        

  end =  omp_get_wtime();                                    

  std:: cout << std::setw(15) << #func <<                    

     ", Time taken: "  << (end-start)/100 << " /sn";



int main()

{

  std::vector<float> sort_order(N*N);

  std::vector<size_t> indices(N*N);

  float radius, azimuth, ci, cj;

  double start, end;

  unsigned short *input, *output;



  ci = N*0.496;  // changes according to calibration

  cj = N*0.4985;  // reality is more complicated (tilts etc)

  for( size_t i=0; i<N; i++){

    for( size_t j=0; j<N; j++){

      radius  = sqrt( (i-ci)*(i-ci) + (j-cj)*(j-cj) );

      azimuth = atan2( i-ci, j-cj ); // from -pi to pi

      sort_order[i*N+j] = round( radius ) + azimuth/2/M_PI;

      indices[i*N+j] = i*N+j;

    }

  }

  // Find the order to sort data onto a spiral 

  std::sort( indices.begin(), indices.end(),

         [&sort_order](int i, int j){

           return sort_order[i] < sort_order[j]; });

  // Invent some test data

  input = new unsigned short [N*N];

  output = new unsigned short [N*N];

  for( size_t i=0 ; i<N*N; i++){

    input[i] = i;

    output[i]= 0;

  }

  // some timing:

  bench(reorder_simple);

  bench(reorder_omp)   ;

  bench(copy_simple)   ;

  bench(copy_omp)      ;

}





   % g++ reorder.cpp -o reorder -std=c++11 -O3 -march=native -fopenmp -Wall

   % ./reorder

     reorder_simple, Time taken: 0.0179023 /s

        reorder_omp, Time taken: 0.00349932 /s

        copy_simple, Time taken: 0.00140805 /s

           copy_omp, Time taken: 0.000250205 /s

I would like to get the reorder_omp function to be closer to the speed of the copy_omp function. Detectors can run at 500 frames per second so 3.5 ms is bad in comparison to 0.25 ms.

Edit again: 21/11/2018 the code to write the function that does not compile

  //top of file

  #include <fstream>  

  ...

  //just before the end: 

  std::ofstream out;

  out.open("cfunc.c");

  out << "void cfunc( unsigned short * restrict input,n" <<

         "            unsigned short * restrict output){ n"; 

  for(int i=0;i<N;i++)

    for(int j=0;j<N;j++)

      out << "output[" << i*N+j << "] = input[" << indices[i*N+j] << "];n";

  out << "}n";

  out.close();

Testing this on a different machine I am getting compiler errors from both gcc (7.3.0) and clang (6.0.0). It compiles and runs with tcc (0.9.27) but finishes slower than the looping over the indices.

edited Nov 21 '18 at 23:10

asked Nov 15 '18 at 14:25

Jon

62649

  for( i=0; i<n; i++)

    dest[i] = src[ idx[i] ] ;

To optimize the code for a specific list of idx[i] I tried to compile a 4 million line c function with the idx values filled in:

void reorder( unsigned short * restrict i, unsigned short * restrict o) {

  o[0]=i[2075723];

  o[1]=i[2075724];

  o[2]=i[2075722];

  ...

  o[4194301]=i[4192257];

  o[4194302]=i[4192256];

  o[4194303]=i[4190208];

 }

Would anyone be able to suggest good ways to produce optimized the code for this problem? So far I tried:

ordered reading, random writes : 60 ms (openmp didn't help)

ordered writing, random reads : 20 ms (openmp -> 4 ms)

Making a toy example with transpose I couldn't even get gcc to output an SIMD version, see:

https://godbolt.org/z/bzGWad

Is this a difficult problem for compilers or am I missing something simple?

Edit 21/11/2018 Added a complete but minimal example of the problem

#include <omp.h>

#include <vector>

#include <cmath>

#include <iostream>

#include <iomanip>

#include <algorithm>



#define N 2048



// Sorting on output, one core

void reorder_simple( const std::vector<size_t> &indices,

             const unsigned short input,

             unsigned short output){

  for( int i=0; i<N*N; i++)

    output[i] = input[ indices[i] ];

}

// Sorting on output write, many cores

void reorder_omp( const std::vector<size_t> &indices,

          const unsigned short input,

          unsigned short output){

#pragma omp parallel for

  for( int i=0; i<N*N; i++)

    output[i] = input[ indices[i] ];

}

// Benchmark for memory throughput, one core

void copy_simple(  const std::vector<size_t> &indices,

           const unsigned short input,

           unsigned short output){

  for( int i=0; i<N*N; i++)

    output[i] = input[i];

}

// Benchmark for memory throughput, many cores

void copy_omp (  const std::vector<size_t> &indices,

         const unsigned short input,

         unsigned short output){

#pragma omp parallel for

  for( int i=0; i<N*N; i++)

    output[i] = input[i];

}



// Macro to avoid retyping

#define bench(func)                                          

  func( indices, input, output);                             

  start = omp_get_wtime();                                   

  for( size_t i=0; i<100; i++)                               

      func( indices, input, output );                        

  end =  omp_get_wtime();                                    

  std:: cout << std::setw(15) << #func <<                    

     ", Time taken: "  << (end-start)/100 << " /sn";



int main()

{

  std::vector<float> sort_order(N*N);

  std::vector<size_t> indices(N*N);

  float radius, azimuth, ci, cj;

  double start, end;

  unsigned short *input, *output;



  ci = N*0.496;  // changes according to calibration

  cj = N*0.4985;  // reality is more complicated (tilts etc)

  for( size_t i=0; i<N; i++){

    for( size_t j=0; j<N; j++){

      radius  = sqrt( (i-ci)*(i-ci) + (j-cj)*(j-cj) );

      azimuth = atan2( i-ci, j-cj ); // from -pi to pi

      sort_order[i*N+j] = round( radius ) + azimuth/2/M_PI;

      indices[i*N+j] = i*N+j;

    }

  }

  // Find the order to sort data onto a spiral 

  std::sort( indices.begin(), indices.end(),

         [&sort_order](int i, int j){

           return sort_order[i] < sort_order[j]; });

  // Invent some test data

  input = new unsigned short [N*N];

  output = new unsigned short [N*N];

  for( size_t i=0 ; i<N*N; i++){

    input[i] = i;

    output[i]= 0;

  }

  // some timing:

  bench(reorder_simple);

  bench(reorder_omp)   ;

  bench(copy_simple)   ;

  bench(copy_omp)      ;

}





   % g++ reorder.cpp -o reorder -std=c++11 -O3 -march=native -fopenmp -Wall

   % ./reorder

     reorder_simple, Time taken: 0.0179023 /s

        reorder_omp, Time taken: 0.00349932 /s

        copy_simple, Time taken: 0.00140805 /s

           copy_omp, Time taken: 0.000250205 /s

I would like to get the reorder_omp function to be closer to the speed of the copy_omp function. Detectors can run at 500 frames per second so 3.5 ms is bad in comparison to 0.25 ms.

Edit again: 21/11/2018 the code to write the function that does not compile

  //top of file

  #include <fstream>  

  ...

  //just before the end: 

  std::ofstream out;

  out.open("cfunc.c");

  out << "void cfunc( unsigned short * restrict input,n" <<

         "            unsigned short * restrict output){ n"; 

  for(int i=0;i<N;i++)

    for(int j=0;j<N;j++)

      out << "output[" << i*N+j << "] = input[" << indices[i*N+j] << "];n";

  out << "}n";

  out.close();

Testing this on a different machine I am getting compiler errors from both gcc (7.3.0) and clang (6.0.0). It compiles and runs with tcc (0.9.27) but finishes slower than the looping over the indices.

c++ arrays gcc optimization compiler-optimization

edited Nov 21 '18 at 23:10

asked Nov 15 '18 at 14:25

Jon

62649

edited Nov 21 '18 at 23:10

asked Nov 15 '18 at 14:25

Jon

62649

edited Nov 21 '18 at 23:10

asked Nov 15 '18 at 14:25

Jon

62649

asked Nov 15 '18 at 14:25

Jon

62649

asked Nov 15 '18 at 14:25

Jon

62649

3

So idx is a compile-time constant? Is there a lot of locality in it? Like groups of destination elements that come from nearby source elements? But there's no pattern to it so you can't make a loop other than by using gather instructions or scalar? Is this x86? What microarchitectures are you tuning for?
– Peter Cordes
Nov 15 '18 at 14:36

3

What is the pattern? The two triplets shown have a output + (0, 1, 2) = input + (1, 2, 0) and output + (0, 1, 2) = input + (1, 0, 2) pattern. Writing the same code out 40 times, let alone 4,000,000 times should give you the shivers (you should write a program to write the program, at least!). I've got a feeling there's an XY Problem here.
– Jonathan Leffler
Nov 15 '18 at 14:47

3

I'm not sure gcc knows how to build shuffles out of scalar loads/stores like that. Even if you use short *__restrict dst to tell it that src and dst don't overlap to make it possible. clang might possibly do something, but I'm not surprised the compile-time memory usage was huge. Your ideas are good; some kind of cache-blocking should be helpful to group reads and writes into small sets of a few (less than 8) cache lines, preferably with some coarser locality with respect to 4k pages too.
– Peter Cordes
Nov 15 '18 at 14:47

2

BTW: are your unsigned shorts wider than 16 bits?
– joop
Nov 15 '18 at 15:02

idx is a compile time constant. This is x86 for now but perhaps we should go to a gpu instead. Idx would change from day to day but remain constant for a block of data (some millions of images). The data are 16 bit. There are patterns, the target output is roughly a spiral from the original rectangle image from the detector Perhaps I should put this info above?
– Jon
Nov 15 '18 at 16:55

|
show 3 more comments

3

So idx is a compile-time constant? Is there a lot of locality in it? Like groups of destination elements that come from nearby source elements? But there's no pattern to it so you can't make a loop other than by using gather instructions or scalar? Is this x86? What microarchitectures are you tuning for?
– Peter Cordes
Nov 15 '18 at 14:36

3

What is the pattern? The two triplets shown have a output + (0, 1, 2) = input + (1, 2, 0) and output + (0, 1, 2) = input + (1, 0, 2) pattern. Writing the same code out 40 times, let alone 4,000,000 times should give you the shivers (you should write a program to write the program, at least!). I've got a feeling there's an XY Problem here.
– Jonathan Leffler
Nov 15 '18 at 14:47

3

I'm not sure gcc knows how to build shuffles out of scalar loads/stores like that. Even if you use short *__restrict dst to tell it that src and dst don't overlap to make it possible. clang might possibly do something, but I'm not surprised the compile-time memory usage was huge. Your ideas are good; some kind of cache-blocking should be helpful to group reads and writes into small sets of a few (less than 8) cache lines, preferably with some coarser locality with respect to 4k pages too.
– Peter Cordes
Nov 15 '18 at 14:47

2

BTW: are your unsigned shorts wider than 16 bits?
– joop
Nov 15 '18 at 15:02

idx is a compile time constant. This is x86 for now but perhaps we should go to a gpu instead. Idx would change from day to day but remain constant for a block of data (some millions of images). The data are 16 bit. There are patterns, the target output is roughly a spiral from the original rectangle image from the detector Perhaps I should put this info above?
– Jon
Nov 15 '18 at 16:55

So idx is a compile-time constant? Is there a lot of locality in it? Like groups of destination elements that come from nearby source elements? But there's no pattern to it so you can't make a loop other than by using gather instructions or scalar? Is this x86? What microarchitectures are you tuning for?
– Peter Cordes
Nov 15 '18 at 14:36

What is the pattern? The two triplets shown have a output + (0, 1, 2) = input + (1, 2, 0) and output + (0, 1, 2) = input + (1, 0, 2) pattern. Writing the same code out 40 times, let alone 4,000,000 times should give you the shivers (you should write a program to write the program, at least!). I've got a feeling there's an XY Problem here.
– Jonathan Leffler
Nov 15 '18 at 14:47

I'm not sure gcc knows how to build shuffles out of scalar loads/stores like that. Even if you use short *__restrict dst to tell it that src and dst don't overlap to make it possible. clang might possibly do something, but I'm not surprised the compile-time memory usage was huge. Your ideas are good; some kind of cache-blocking should be helpful to group reads and writes into small sets of a few (less than 8) cache lines, preferably with some coarser locality with respect to 4k pages too.
– Peter Cordes
Nov 15 '18 at 14:47

BTW: are your unsigned shorts wider than 16 bits?
– joop
Nov 15 '18 at 15:02

idx is a compile time constant. This is x86 for now but perhaps we should go to a gpu instead. Idx would change from day to day but remain constant for a block of data (some millions of images). The data are 16 bit. There are patterns, the target output is roughly a spiral from the original rectangle image from the detector Perhaps I should put this info above?
– Jon
Nov 15 '18 at 16:55

|
show 3 more comments

1 Answer
1

active

oldest

votes

(comment section is too short)

I would test the following idea:

Maintain reverse index table, so that naive algorithm becomes:

 for (i = 0; i<n; i++) {

   dest[index[i]] = src[i];

 }

Instead of using naive algorithm:

2.1 Create temporary array of pairs (value, destindex)
```
struct pair {

  int value;

  int destindex;

};

for (i = 0; i < n; i++) {

  pairs[i] = {.value=src[i], .destindex=index[i]};

}
```
2.2 Use merge or quick sort to sort array of pairs by .destindex field

2.3 Copy values from array of pairs into dest array

There are no random accesses in this algorithm and thus no random access page faults. However, I am not sure that it will work better than naive algorithm because of large number of linear passes.

answered Nov 21 '18 at 17:19

gudok

2,71611022

The OP already considered your 1. idea, and found "ordered reading, random writes" 3x slower than "ordered writing, random reads". Sorting could be parallelized and done with better locality, but yeah it's probably not a win.
– Peter Cordes
Nov 21 '18 at 17:26

I tried something like this, creating small blocks of output (64 values) and ordering these according to what was recently read from improved the run time by about 20%. It helped, but maybe the ordering was still bad for pipelines and there was still no vectorisation.
– Jon
Nov 21 '18 at 23:27

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53321584%2fhow-to-optimise-the-re-ordering-of-an-array%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

(comment section is too short)

I would test the following idea:

Maintain reverse index table, so that naive algorithm becomes:

 for (i = 0; i<n; i++) {

   dest[index[i]] = src[i];

 }

Instead of using naive algorithm:

2.1 Create temporary array of pairs (value, destindex)
```
struct pair {

  int value;

  int destindex;

};

for (i = 0; i < n; i++) {

  pairs[i] = {.value=src[i], .destindex=index[i]};

}
```
2.2 Use merge or quick sort to sort array of pairs by .destindex field

2.3 Copy values from array of pairs into dest array

There are no random accesses in this algorithm and thus no random access page faults. However, I am not sure that it will work better than naive algorithm because of large number of linear passes.

answered Nov 21 '18 at 17:19

gudok

2,71611022

The OP already considered your 1. idea, and found "ordered reading, random writes" 3x slower than "ordered writing, random reads". Sorting could be parallelized and done with better locality, but yeah it's probably not a win.
– Peter Cordes
Nov 21 '18 at 17:26

I tried something like this, creating small blocks of output (64 values) and ordering these according to what was recently read from improved the run time by about 20%. It helped, but maybe the ordering was still bad for pipelines and there was still no vectorisation.
– Jon
Nov 21 '18 at 23:27

add a comment |

(comment section is too short)

I would test the following idea:

Maintain reverse index table, so that naive algorithm becomes:

 for (i = 0; i<n; i++) {

   dest[index[i]] = src[i];

 }

Instead of using naive algorithm:

2.1 Create temporary array of pairs (value, destindex)
```
struct pair {

  int value;

  int destindex;

};

for (i = 0; i < n; i++) {

  pairs[i] = {.value=src[i], .destindex=index[i]};

}
```
2.2 Use merge or quick sort to sort array of pairs by .destindex field

2.3 Copy values from array of pairs into dest array

There are no random accesses in this algorithm and thus no random access page faults. However, I am not sure that it will work better than naive algorithm because of large number of linear passes.

answered Nov 21 '18 at 17:19

gudok

2,71611022

The OP already considered your 1. idea, and found "ordered reading, random writes" 3x slower than "ordered writing, random reads". Sorting could be parallelized and done with better locality, but yeah it's probably not a win.
– Peter Cordes
Nov 21 '18 at 17:26

I tried something like this, creating small blocks of output (64 values) and ordering these according to what was recently read from improved the run time by about 20%. It helped, but maybe the ordering was still bad for pipelines and there was still no vectorisation.
– Jon
Nov 21 '18 at 23:27

add a comment |

(comment section is too short)

I would test the following idea:

Maintain reverse index table, so that naive algorithm becomes:

 for (i = 0; i<n; i++) {

   dest[index[i]] = src[i];

 }

Instead of using naive algorithm:

2.1 Create temporary array of pairs (value, destindex)
```
struct pair {

  int value;

  int destindex;

};

for (i = 0; i < n; i++) {

  pairs[i] = {.value=src[i], .destindex=index[i]};

}
```
2.2 Use merge or quick sort to sort array of pairs by .destindex field

2.3 Copy values from array of pairs into dest array

There are no random accesses in this algorithm and thus no random access page faults. However, I am not sure that it will work better than naive algorithm because of large number of linear passes.

answered Nov 21 '18 at 17:19

gudok

2,71611022

(comment section is too short)

I would test the following idea:

Maintain reverse index table, so that naive algorithm becomes:

 for (i = 0; i<n; i++) {

   dest[index[i]] = src[i];

 }

Instead of using naive algorithm:

2.1 Create temporary array of pairs (value, destindex)
```
struct pair {

  int value;

  int destindex;

};

for (i = 0; i < n; i++) {

  pairs[i] = {.value=src[i], .destindex=index[i]};

}
```
2.2 Use merge or quick sort to sort array of pairs by .destindex field

2.3 Copy values from array of pairs into dest array

There are no random accesses in this algorithm and thus no random access page faults. However, I am not sure that it will work better than naive algorithm because of large number of linear passes.

answered Nov 21 '18 at 17:19

gudok

2,71611022

answered Nov 21 '18 at 17:19

gudok

2,71611022

answered Nov 21 '18 at 17:19

gudok

2,71611022

answered Nov 21 '18 at 17:19

gudok

2,71611022

The OP already considered your 1. idea, and found "ordered reading, random writes" 3x slower than "ordered writing, random reads". Sorting could be parallelized and done with better locality, but yeah it's probably not a win.
– Peter Cordes
Nov 21 '18 at 17:26

I tried something like this, creating small blocks of output (64 values) and ordering these according to what was recently read from improved the run time by about 20%. It helped, but maybe the ordering was still bad for pipelines and there was still no vectorisation.
– Jon
Nov 21 '18 at 23:27

add a comment |

The OP already considered your 1. idea, and found "ordered reading, random writes" 3x slower than "ordered writing, random reads". Sorting could be parallelized and done with better locality, but yeah it's probably not a win.
– Peter Cordes
Nov 21 '18 at 17:26

I tried something like this, creating small blocks of output (64 values) and ordering these according to what was recently read from improved the run time by about 20%. It helped, but maybe the ordering was still bad for pipelines and there was still no vectorisation.
– Jon
Nov 21 '18 at 23:27

The OP already considered your 1. idea, and found "ordered reading, random writes" 3x slower than "ordered writing, random reads". Sorting could be parallelized and done with better locality, but yeah it's probably not a win.
– Peter Cordes
Nov 21 '18 at 17:26

I tried something like this, creating small blocks of output (64 values) and ordering these according to what was recently read from improved the run time by about 20%. It helped, but maybe the ordering was still bad for pipelines and there was still no vectorisation.
– Jon
Nov 21 '18 at 23:27

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

omwuY2st3BlSMZ2WAFAdi4gZujjsDk8I,RTTq4e75mjDbQ rJhU cQ6O,ZwVJnoBEOZHi7ua8yCNdySPg

搜尋此網誌

Tukukkk