CUDA byte atomic operation to cause only one thread to act











up vote
0
down vote

favorite












I am writing a CUDA program which has an array defined in the shared memory. What I need to do is to allow only one thread to write each index in this array, i. e. the first thread to reach this write instruction should change its value but any other threads either in the same warp or next warps should read the written value.



Here is the code snippet:



char* seq_copied = seqs + (njobNew * halfLength); //this is the shared memory array
if (seq_copied[seq_1_index] == false) { //here is the condition that I need to check with only one thread
seq_copied[seq_1_index] = true; //and this is the write that should be written by only one thread
printf("copy seq_shared seq_1_index = %d, block = %d n", seq_1_index, blockIdx.x);
}


What is happening now is that all threads in the warp executes these exact sequence of instructions, thus the remaining code in the if condition is executed 32 times. I need to execute it only once.



How can I achieve that ?










share|improve this question




























    up vote
    0
    down vote

    favorite












    I am writing a CUDA program which has an array defined in the shared memory. What I need to do is to allow only one thread to write each index in this array, i. e. the first thread to reach this write instruction should change its value but any other threads either in the same warp or next warps should read the written value.



    Here is the code snippet:



    char* seq_copied = seqs + (njobNew * halfLength); //this is the shared memory array
    if (seq_copied[seq_1_index] == false) { //here is the condition that I need to check with only one thread
    seq_copied[seq_1_index] = true; //and this is the write that should be written by only one thread
    printf("copy seq_shared seq_1_index = %d, block = %d n", seq_1_index, blockIdx.x);
    }


    What is happening now is that all threads in the warp executes these exact sequence of instructions, thus the remaining code in the if condition is executed 32 times. I need to execute it only once.



    How can I achieve that ?










    share|improve this question


























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      I am writing a CUDA program which has an array defined in the shared memory. What I need to do is to allow only one thread to write each index in this array, i. e. the first thread to reach this write instruction should change its value but any other threads either in the same warp or next warps should read the written value.



      Here is the code snippet:



      char* seq_copied = seqs + (njobNew * halfLength); //this is the shared memory array
      if (seq_copied[seq_1_index] == false) { //here is the condition that I need to check with only one thread
      seq_copied[seq_1_index] = true; //and this is the write that should be written by only one thread
      printf("copy seq_shared seq_1_index = %d, block = %d n", seq_1_index, blockIdx.x);
      }


      What is happening now is that all threads in the warp executes these exact sequence of instructions, thus the remaining code in the if condition is executed 32 times. I need to execute it only once.



      How can I achieve that ?










      share|improve this question















      I am writing a CUDA program which has an array defined in the shared memory. What I need to do is to allow only one thread to write each index in this array, i. e. the first thread to reach this write instruction should change its value but any other threads either in the same warp or next warps should read the written value.



      Here is the code snippet:



      char* seq_copied = seqs + (njobNew * halfLength); //this is the shared memory array
      if (seq_copied[seq_1_index] == false) { //here is the condition that I need to check with only one thread
      seq_copied[seq_1_index] = true; //and this is the write that should be written by only one thread
      printf("copy seq_shared seq_1_index = %d, block = %d n", seq_1_index, blockIdx.x);
      }


      What is happening now is that all threads in the warp executes these exact sequence of instructions, thus the remaining code in the if condition is executed 32 times. I need to execute it only once.



      How can I achieve that ?







      cuda shared-memory atomic






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 21 at 18:18









      Robert Crovella

      93.1k4101143




      93.1k4101143










      asked Nov 18 at 23:19









      Amani Elsaed

      551519




      551519
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          3
          down vote



          accepted










          You can use atomicCAS() for this. It does an atomic Compare-And-Swap operation.



          This function will test a variable, and if it matches a certain condition (say, false) it will replace it with another value (say, true). It will do all these things atomically, i.e. without the possibility of interruption.



          The return value of the atomic function gives us useful information in this case. If the return value is false for the above example, then we can be certain that it was replaced with true. We can also be certain that we were the "first" thread to run into this condition, and all other threads doing a similar operation will have a return value of true, not false.



          Here's a worked example:



          $ cat t327.cu
          #include <stdio.h>

          __global__ void k(){

          __shared__ int flag;
          if (threadIdx.x == 0) flag = 0;
          __syncthreads();

          int retval = atomicCAS(&flag, 0, 1);
          printf("thread %d saw flag as %dn", threadIdx.x, retval);
          // could do if statement on retval here
          }


          int main(){

          k<<<1,32>>>();
          cudaDeviceSynchronize();
          }
          $ nvcc -o t327 t327.cu
          $ cuda-memcheck ./t327
          ========= CUDA-MEMCHECK
          thread 0 saw flag as 0
          thread 1 saw flag as 1
          thread 2 saw flag as 1
          thread 3 saw flag as 1
          thread 4 saw flag as 1
          thread 5 saw flag as 1
          thread 6 saw flag as 1
          thread 7 saw flag as 1
          thread 8 saw flag as 1
          thread 9 saw flag as 1
          thread 10 saw flag as 1
          thread 11 saw flag as 1
          thread 12 saw flag as 1
          thread 13 saw flag as 1
          thread 14 saw flag as 1
          thread 15 saw flag as 1
          thread 16 saw flag as 1
          thread 17 saw flag as 1
          thread 18 saw flag as 1
          thread 19 saw flag as 1
          thread 20 saw flag as 1
          thread 21 saw flag as 1
          thread 22 saw flag as 1
          thread 23 saw flag as 1
          thread 24 saw flag as 1
          thread 25 saw flag as 1
          thread 26 saw flag as 1
          thread 27 saw flag as 1
          thread 28 saw flag as 1
          thread 29 saw flag as 1
          thread 30 saw flag as 1
          thread 31 saw flag as 1
          ========= ERROR SUMMARY: 0 errors
          $


          Responding to a question in the comments, we could extend this to a char sized flag by creating an arbitrary atomic operation modeled after the double atomicAdd() function given in the programming guide. The basic idea is that we will perform an atomicCAS using a supported data size (e.g. unsigned) and we will convert the needed operation to effectively support a char size. This is done by converting the char address to a suitably-aligned unsigned address, and then doing shifting of the char quantity to line up in the appropriate byte position in the unsigned value.



          Here is a worked example:



          $ cat t327.cu
          #include <stdio.h>
          __device__ char my_char_atomicCAS(char *addr, char cmp, char val){
          unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
          unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
          unsigned mask = 0xFFU;
          mask <<= al_offset;
          mask = ~mask;
          unsigned sval = val;
          sval <<= al_offset;
          unsigned old = *al_addr, assumed, setval;
          do {
          assumed = old;
          setval = assumed & mask;
          setval |= sval;
          old = atomicCAS(al_addr, assumed, setval);
          } while (assumed != old);
          return (char) ((assumed >> al_offset) & 0xFFU);
          }

          __global__ void k(){

          __shared__ char flag[1024];
          flag[threadIdx.x] = 0;
          __syncthreads();

          int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1);
          printf("thread %d saw flag as %dn", threadIdx.x, retval);
          }


          int main(){
          k<<<1,32>>>();
          cudaDeviceSynchronize();
          }
          $ nvcc -o t327 t327.cu
          $ cuda-memcheck ./t327
          ========= CUDA-MEMCHECK
          thread 0 saw flag as 0
          thread 1 saw flag as 1
          thread 2 saw flag as 0
          thread 3 saw flag as 1
          thread 4 saw flag as 0
          thread 5 saw flag as 1
          thread 6 saw flag as 0
          thread 7 saw flag as 1
          thread 8 saw flag as 0
          thread 9 saw flag as 1
          thread 10 saw flag as 0
          thread 11 saw flag as 1
          thread 12 saw flag as 0
          thread 13 saw flag as 1
          thread 14 saw flag as 0
          thread 15 saw flag as 1
          thread 16 saw flag as 0
          thread 17 saw flag as 1
          thread 18 saw flag as 0
          thread 19 saw flag as 1
          thread 20 saw flag as 0
          thread 21 saw flag as 1
          thread 22 saw flag as 0
          thread 23 saw flag as 1
          thread 24 saw flag as 0
          thread 25 saw flag as 1
          thread 26 saw flag as 0
          thread 27 saw flag as 1
          thread 28 saw flag as 0
          thread 29 saw flag as 1
          thread 30 saw flag as 0
          thread 31 saw flag as 1
          ========= ERROR SUMMARY: 0 errors
          $


          The above presents a generalized atomicCAS for char size. This would allow you to swap any char value for any other char value. In your specific case, if you only need effectively a boolean flag, you can make this operation more efficient using atomicOr as already mentioned in the comments. The use of the atomicOr would allow you to eliminate the loop in the custom atomic function above. Here is a worked example:



          $ cat t327.cu
          #include <stdio.h>
          __device__ char my_char_atomic_flag(char *addr){
          unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
          unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
          unsigned my_bit = 1U << al_offset;
          return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU);
          }

          __global__ void k(){

          __shared__ char flag[1024];
          flag[threadIdx.x] = 0;
          __syncthreads();

          int retval = my_char_atomic_flag(flag+(threadIdx.x>>1));
          printf("thread %d saw flag as %dn", threadIdx.x, retval);
          }


          int main(){
          k<<<1,32>>>();
          cudaDeviceSynchronize();
          }
          $ nvcc -o t327 t327.cu
          $ cuda-memcheck ./t327
          ========= CUDA-MEMCHECK
          thread 0 saw flag as 0
          thread 1 saw flag as 1
          thread 2 saw flag as 0
          thread 3 saw flag as 1
          thread 4 saw flag as 0
          thread 5 saw flag as 1
          thread 6 saw flag as 0
          thread 7 saw flag as 1
          thread 8 saw flag as 0
          thread 9 saw flag as 1
          thread 10 saw flag as 0
          thread 11 saw flag as 1
          thread 12 saw flag as 0
          thread 13 saw flag as 1
          thread 14 saw flag as 0
          thread 15 saw flag as 1
          thread 16 saw flag as 0
          thread 17 saw flag as 1
          thread 18 saw flag as 0
          thread 19 saw flag as 1
          thread 20 saw flag as 0
          thread 21 saw flag as 1
          thread 22 saw flag as 0
          thread 23 saw flag as 1
          thread 24 saw flag as 0
          thread 25 saw flag as 1
          thread 26 saw flag as 0
          thread 27 saw flag as 1
          thread 28 saw flag as 0
          thread 29 saw flag as 1
          thread 30 saw flag as 0
          thread 31 saw flag as 1
          ========= ERROR SUMMARY: 0 errors
          $


          These char atomic methods assume that you have allocated a char array whose size is a multiple of 4. It would not be valid to do this with a char array of size 3 (and only 3 threads), for example.






          share|improve this answer























          • Great. It works correctly. But how can I apply it to use 'char' ? It accepts only int values.
            – Amani Elsaed
            Nov 20 at 2:12










          • You can use atomicOr() to reduce memory consumption even down to one bit per flag - but you'll have to fiddle with with bitmasks and array offsets.
            – tera
            Nov 20 at 5:37












          • @Robert thanks for your detailed answer. It helped me to achieve my target and it works fine now. Thanks !!
            – Amani Elsaed
            Nov 25 at 0:18











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53366421%2fcuda-byte-atomic-operation-to-cause-only-one-thread-to-act%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          3
          down vote



          accepted










          You can use atomicCAS() for this. It does an atomic Compare-And-Swap operation.



          This function will test a variable, and if it matches a certain condition (say, false) it will replace it with another value (say, true). It will do all these things atomically, i.e. without the possibility of interruption.



          The return value of the atomic function gives us useful information in this case. If the return value is false for the above example, then we can be certain that it was replaced with true. We can also be certain that we were the "first" thread to run into this condition, and all other threads doing a similar operation will have a return value of true, not false.



          Here's a worked example:



          $ cat t327.cu
          #include <stdio.h>

          __global__ void k(){

          __shared__ int flag;
          if (threadIdx.x == 0) flag = 0;
          __syncthreads();

          int retval = atomicCAS(&flag, 0, 1);
          printf("thread %d saw flag as %dn", threadIdx.x, retval);
          // could do if statement on retval here
          }


          int main(){

          k<<<1,32>>>();
          cudaDeviceSynchronize();
          }
          $ nvcc -o t327 t327.cu
          $ cuda-memcheck ./t327
          ========= CUDA-MEMCHECK
          thread 0 saw flag as 0
          thread 1 saw flag as 1
          thread 2 saw flag as 1
          thread 3 saw flag as 1
          thread 4 saw flag as 1
          thread 5 saw flag as 1
          thread 6 saw flag as 1
          thread 7 saw flag as 1
          thread 8 saw flag as 1
          thread 9 saw flag as 1
          thread 10 saw flag as 1
          thread 11 saw flag as 1
          thread 12 saw flag as 1
          thread 13 saw flag as 1
          thread 14 saw flag as 1
          thread 15 saw flag as 1
          thread 16 saw flag as 1
          thread 17 saw flag as 1
          thread 18 saw flag as 1
          thread 19 saw flag as 1
          thread 20 saw flag as 1
          thread 21 saw flag as 1
          thread 22 saw flag as 1
          thread 23 saw flag as 1
          thread 24 saw flag as 1
          thread 25 saw flag as 1
          thread 26 saw flag as 1
          thread 27 saw flag as 1
          thread 28 saw flag as 1
          thread 29 saw flag as 1
          thread 30 saw flag as 1
          thread 31 saw flag as 1
          ========= ERROR SUMMARY: 0 errors
          $


          Responding to a question in the comments, we could extend this to a char sized flag by creating an arbitrary atomic operation modeled after the double atomicAdd() function given in the programming guide. The basic idea is that we will perform an atomicCAS using a supported data size (e.g. unsigned) and we will convert the needed operation to effectively support a char size. This is done by converting the char address to a suitably-aligned unsigned address, and then doing shifting of the char quantity to line up in the appropriate byte position in the unsigned value.



          Here is a worked example:



          $ cat t327.cu
          #include <stdio.h>
          __device__ char my_char_atomicCAS(char *addr, char cmp, char val){
          unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
          unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
          unsigned mask = 0xFFU;
          mask <<= al_offset;
          mask = ~mask;
          unsigned sval = val;
          sval <<= al_offset;
          unsigned old = *al_addr, assumed, setval;
          do {
          assumed = old;
          setval = assumed & mask;
          setval |= sval;
          old = atomicCAS(al_addr, assumed, setval);
          } while (assumed != old);
          return (char) ((assumed >> al_offset) & 0xFFU);
          }

          __global__ void k(){

          __shared__ char flag[1024];
          flag[threadIdx.x] = 0;
          __syncthreads();

          int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1);
          printf("thread %d saw flag as %dn", threadIdx.x, retval);
          }


          int main(){
          k<<<1,32>>>();
          cudaDeviceSynchronize();
          }
          $ nvcc -o t327 t327.cu
          $ cuda-memcheck ./t327
          ========= CUDA-MEMCHECK
          thread 0 saw flag as 0
          thread 1 saw flag as 1
          thread 2 saw flag as 0
          thread 3 saw flag as 1
          thread 4 saw flag as 0
          thread 5 saw flag as 1
          thread 6 saw flag as 0
          thread 7 saw flag as 1
          thread 8 saw flag as 0
          thread 9 saw flag as 1
          thread 10 saw flag as 0
          thread 11 saw flag as 1
          thread 12 saw flag as 0
          thread 13 saw flag as 1
          thread 14 saw flag as 0
          thread 15 saw flag as 1
          thread 16 saw flag as 0
          thread 17 saw flag as 1
          thread 18 saw flag as 0
          thread 19 saw flag as 1
          thread 20 saw flag as 0
          thread 21 saw flag as 1
          thread 22 saw flag as 0
          thread 23 saw flag as 1
          thread 24 saw flag as 0
          thread 25 saw flag as 1
          thread 26 saw flag as 0
          thread 27 saw flag as 1
          thread 28 saw flag as 0
          thread 29 saw flag as 1
          thread 30 saw flag as 0
          thread 31 saw flag as 1
          ========= ERROR SUMMARY: 0 errors
          $


          The above presents a generalized atomicCAS for char size. This would allow you to swap any char value for any other char value. In your specific case, if you only need effectively a boolean flag, you can make this operation more efficient using atomicOr as already mentioned in the comments. The use of the atomicOr would allow you to eliminate the loop in the custom atomic function above. Here is a worked example:



          $ cat t327.cu
          #include <stdio.h>
          __device__ char my_char_atomic_flag(char *addr){
          unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
          unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
          unsigned my_bit = 1U << al_offset;
          return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU);
          }

          __global__ void k(){

          __shared__ char flag[1024];
          flag[threadIdx.x] = 0;
          __syncthreads();

          int retval = my_char_atomic_flag(flag+(threadIdx.x>>1));
          printf("thread %d saw flag as %dn", threadIdx.x, retval);
          }


          int main(){
          k<<<1,32>>>();
          cudaDeviceSynchronize();
          }
          $ nvcc -o t327 t327.cu
          $ cuda-memcheck ./t327
          ========= CUDA-MEMCHECK
          thread 0 saw flag as 0
          thread 1 saw flag as 1
          thread 2 saw flag as 0
          thread 3 saw flag as 1
          thread 4 saw flag as 0
          thread 5 saw flag as 1
          thread 6 saw flag as 0
          thread 7 saw flag as 1
          thread 8 saw flag as 0
          thread 9 saw flag as 1
          thread 10 saw flag as 0
          thread 11 saw flag as 1
          thread 12 saw flag as 0
          thread 13 saw flag as 1
          thread 14 saw flag as 0
          thread 15 saw flag as 1
          thread 16 saw flag as 0
          thread 17 saw flag as 1
          thread 18 saw flag as 0
          thread 19 saw flag as 1
          thread 20 saw flag as 0
          thread 21 saw flag as 1
          thread 22 saw flag as 0
          thread 23 saw flag as 1
          thread 24 saw flag as 0
          thread 25 saw flag as 1
          thread 26 saw flag as 0
          thread 27 saw flag as 1
          thread 28 saw flag as 0
          thread 29 saw flag as 1
          thread 30 saw flag as 0
          thread 31 saw flag as 1
          ========= ERROR SUMMARY: 0 errors
          $


          These char atomic methods assume that you have allocated a char array whose size is a multiple of 4. It would not be valid to do this with a char array of size 3 (and only 3 threads), for example.






          share|improve this answer























          • Great. It works correctly. But how can I apply it to use 'char' ? It accepts only int values.
            – Amani Elsaed
            Nov 20 at 2:12










          • You can use atomicOr() to reduce memory consumption even down to one bit per flag - but you'll have to fiddle with with bitmasks and array offsets.
            – tera
            Nov 20 at 5:37












          • @Robert thanks for your detailed answer. It helped me to achieve my target and it works fine now. Thanks !!
            – Amani Elsaed
            Nov 25 at 0:18















          up vote
          3
          down vote



          accepted










          You can use atomicCAS() for this. It does an atomic Compare-And-Swap operation.



          This function will test a variable, and if it matches a certain condition (say, false) it will replace it with another value (say, true). It will do all these things atomically, i.e. without the possibility of interruption.



          The return value of the atomic function gives us useful information in this case. If the return value is false for the above example, then we can be certain that it was replaced with true. We can also be certain that we were the "first" thread to run into this condition, and all other threads doing a similar operation will have a return value of true, not false.



          Here's a worked example:



          $ cat t327.cu
          #include <stdio.h>

          __global__ void k(){

          __shared__ int flag;
          if (threadIdx.x == 0) flag = 0;
          __syncthreads();

          int retval = atomicCAS(&flag, 0, 1);
          printf("thread %d saw flag as %dn", threadIdx.x, retval);
          // could do if statement on retval here
          }


          int main(){

          k<<<1,32>>>();
          cudaDeviceSynchronize();
          }
          $ nvcc -o t327 t327.cu
          $ cuda-memcheck ./t327
          ========= CUDA-MEMCHECK
          thread 0 saw flag as 0
          thread 1 saw flag as 1
          thread 2 saw flag as 1
          thread 3 saw flag as 1
          thread 4 saw flag as 1
          thread 5 saw flag as 1
          thread 6 saw flag as 1
          thread 7 saw flag as 1
          thread 8 saw flag as 1
          thread 9 saw flag as 1
          thread 10 saw flag as 1
          thread 11 saw flag as 1
          thread 12 saw flag as 1
          thread 13 saw flag as 1
          thread 14 saw flag as 1
          thread 15 saw flag as 1
          thread 16 saw flag as 1
          thread 17 saw flag as 1
          thread 18 saw flag as 1
          thread 19 saw flag as 1
          thread 20 saw flag as 1
          thread 21 saw flag as 1
          thread 22 saw flag as 1
          thread 23 saw flag as 1
          thread 24 saw flag as 1
          thread 25 saw flag as 1
          thread 26 saw flag as 1
          thread 27 saw flag as 1
          thread 28 saw flag as 1
          thread 29 saw flag as 1
          thread 30 saw flag as 1
          thread 31 saw flag as 1
          ========= ERROR SUMMARY: 0 errors
          $


          Responding to a question in the comments, we could extend this to a char sized flag by creating an arbitrary atomic operation modeled after the double atomicAdd() function given in the programming guide. The basic idea is that we will perform an atomicCAS using a supported data size (e.g. unsigned) and we will convert the needed operation to effectively support a char size. This is done by converting the char address to a suitably-aligned unsigned address, and then doing shifting of the char quantity to line up in the appropriate byte position in the unsigned value.



          Here is a worked example:



          $ cat t327.cu
          #include <stdio.h>
          __device__ char my_char_atomicCAS(char *addr, char cmp, char val){
          unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
          unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
          unsigned mask = 0xFFU;
          mask <<= al_offset;
          mask = ~mask;
          unsigned sval = val;
          sval <<= al_offset;
          unsigned old = *al_addr, assumed, setval;
          do {
          assumed = old;
          setval = assumed & mask;
          setval |= sval;
          old = atomicCAS(al_addr, assumed, setval);
          } while (assumed != old);
          return (char) ((assumed >> al_offset) & 0xFFU);
          }

          __global__ void k(){

          __shared__ char flag[1024];
          flag[threadIdx.x] = 0;
          __syncthreads();

          int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1);
          printf("thread %d saw flag as %dn", threadIdx.x, retval);
          }


          int main(){
          k<<<1,32>>>();
          cudaDeviceSynchronize();
          }
          $ nvcc -o t327 t327.cu
          $ cuda-memcheck ./t327
          ========= CUDA-MEMCHECK
          thread 0 saw flag as 0
          thread 1 saw flag as 1
          thread 2 saw flag as 0
          thread 3 saw flag as 1
          thread 4 saw flag as 0
          thread 5 saw flag as 1
          thread 6 saw flag as 0
          thread 7 saw flag as 1
          thread 8 saw flag as 0
          thread 9 saw flag as 1
          thread 10 saw flag as 0
          thread 11 saw flag as 1
          thread 12 saw flag as 0
          thread 13 saw flag as 1
          thread 14 saw flag as 0
          thread 15 saw flag as 1
          thread 16 saw flag as 0
          thread 17 saw flag as 1
          thread 18 saw flag as 0
          thread 19 saw flag as 1
          thread 20 saw flag as 0
          thread 21 saw flag as 1
          thread 22 saw flag as 0
          thread 23 saw flag as 1
          thread 24 saw flag as 0
          thread 25 saw flag as 1
          thread 26 saw flag as 0
          thread 27 saw flag as 1
          thread 28 saw flag as 0
          thread 29 saw flag as 1
          thread 30 saw flag as 0
          thread 31 saw flag as 1
          ========= ERROR SUMMARY: 0 errors
          $


          The above presents a generalized atomicCAS for char size. This would allow you to swap any char value for any other char value. In your specific case, if you only need effectively a boolean flag, you can make this operation more efficient using atomicOr as already mentioned in the comments. The use of the atomicOr would allow you to eliminate the loop in the custom atomic function above. Here is a worked example:



          $ cat t327.cu
          #include <stdio.h>
          __device__ char my_char_atomic_flag(char *addr){
          unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
          unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
          unsigned my_bit = 1U << al_offset;
          return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU);
          }

          __global__ void k(){

          __shared__ char flag[1024];
          flag[threadIdx.x] = 0;
          __syncthreads();

          int retval = my_char_atomic_flag(flag+(threadIdx.x>>1));
          printf("thread %d saw flag as %dn", threadIdx.x, retval);
          }


          int main(){
          k<<<1,32>>>();
          cudaDeviceSynchronize();
          }
          $ nvcc -o t327 t327.cu
          $ cuda-memcheck ./t327
          ========= CUDA-MEMCHECK
          thread 0 saw flag as 0
          thread 1 saw flag as 1
          thread 2 saw flag as 0
          thread 3 saw flag as 1
          thread 4 saw flag as 0
          thread 5 saw flag as 1
          thread 6 saw flag as 0
          thread 7 saw flag as 1
          thread 8 saw flag as 0
          thread 9 saw flag as 1
          thread 10 saw flag as 0
          thread 11 saw flag as 1
          thread 12 saw flag as 0
          thread 13 saw flag as 1
          thread 14 saw flag as 0
          thread 15 saw flag as 1
          thread 16 saw flag as 0
          thread 17 saw flag as 1
          thread 18 saw flag as 0
          thread 19 saw flag as 1
          thread 20 saw flag as 0
          thread 21 saw flag as 1
          thread 22 saw flag as 0
          thread 23 saw flag as 1
          thread 24 saw flag as 0
          thread 25 saw flag as 1
          thread 26 saw flag as 0
          thread 27 saw flag as 1
          thread 28 saw flag as 0
          thread 29 saw flag as 1
          thread 30 saw flag as 0
          thread 31 saw flag as 1
          ========= ERROR SUMMARY: 0 errors
          $


          These char atomic methods assume that you have allocated a char array whose size is a multiple of 4. It would not be valid to do this with a char array of size 3 (and only 3 threads), for example.






          share|improve this answer























          • Great. It works correctly. But how can I apply it to use 'char' ? It accepts only int values.
            – Amani Elsaed
            Nov 20 at 2:12










          • You can use atomicOr() to reduce memory consumption even down to one bit per flag - but you'll have to fiddle with with bitmasks and array offsets.
            – tera
            Nov 20 at 5:37












          • @Robert thanks for your detailed answer. It helped me to achieve my target and it works fine now. Thanks !!
            – Amani Elsaed
            Nov 25 at 0:18













          up vote
          3
          down vote



          accepted







          up vote
          3
          down vote



          accepted






          You can use atomicCAS() for this. It does an atomic Compare-And-Swap operation.



          This function will test a variable, and if it matches a certain condition (say, false) it will replace it with another value (say, true). It will do all these things atomically, i.e. without the possibility of interruption.



          The return value of the atomic function gives us useful information in this case. If the return value is false for the above example, then we can be certain that it was replaced with true. We can also be certain that we were the "first" thread to run into this condition, and all other threads doing a similar operation will have a return value of true, not false.



          Here's a worked example:



          $ cat t327.cu
          #include <stdio.h>

          __global__ void k(){

          __shared__ int flag;
          if (threadIdx.x == 0) flag = 0;
          __syncthreads();

          int retval = atomicCAS(&flag, 0, 1);
          printf("thread %d saw flag as %dn", threadIdx.x, retval);
          // could do if statement on retval here
          }


          int main(){

          k<<<1,32>>>();
          cudaDeviceSynchronize();
          }
          $ nvcc -o t327 t327.cu
          $ cuda-memcheck ./t327
          ========= CUDA-MEMCHECK
          thread 0 saw flag as 0
          thread 1 saw flag as 1
          thread 2 saw flag as 1
          thread 3 saw flag as 1
          thread 4 saw flag as 1
          thread 5 saw flag as 1
          thread 6 saw flag as 1
          thread 7 saw flag as 1
          thread 8 saw flag as 1
          thread 9 saw flag as 1
          thread 10 saw flag as 1
          thread 11 saw flag as 1
          thread 12 saw flag as 1
          thread 13 saw flag as 1
          thread 14 saw flag as 1
          thread 15 saw flag as 1
          thread 16 saw flag as 1
          thread 17 saw flag as 1
          thread 18 saw flag as 1
          thread 19 saw flag as 1
          thread 20 saw flag as 1
          thread 21 saw flag as 1
          thread 22 saw flag as 1
          thread 23 saw flag as 1
          thread 24 saw flag as 1
          thread 25 saw flag as 1
          thread 26 saw flag as 1
          thread 27 saw flag as 1
          thread 28 saw flag as 1
          thread 29 saw flag as 1
          thread 30 saw flag as 1
          thread 31 saw flag as 1
          ========= ERROR SUMMARY: 0 errors
          $


          Responding to a question in the comments, we could extend this to a char sized flag by creating an arbitrary atomic operation modeled after the double atomicAdd() function given in the programming guide. The basic idea is that we will perform an atomicCAS using a supported data size (e.g. unsigned) and we will convert the needed operation to effectively support a char size. This is done by converting the char address to a suitably-aligned unsigned address, and then doing shifting of the char quantity to line up in the appropriate byte position in the unsigned value.



          Here is a worked example:



          $ cat t327.cu
          #include <stdio.h>
          __device__ char my_char_atomicCAS(char *addr, char cmp, char val){
          unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
          unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
          unsigned mask = 0xFFU;
          mask <<= al_offset;
          mask = ~mask;
          unsigned sval = val;
          sval <<= al_offset;
          unsigned old = *al_addr, assumed, setval;
          do {
          assumed = old;
          setval = assumed & mask;
          setval |= sval;
          old = atomicCAS(al_addr, assumed, setval);
          } while (assumed != old);
          return (char) ((assumed >> al_offset) & 0xFFU);
          }

          __global__ void k(){

          __shared__ char flag[1024];
          flag[threadIdx.x] = 0;
          __syncthreads();

          int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1);
          printf("thread %d saw flag as %dn", threadIdx.x, retval);
          }


          int main(){
          k<<<1,32>>>();
          cudaDeviceSynchronize();
          }
          $ nvcc -o t327 t327.cu
          $ cuda-memcheck ./t327
          ========= CUDA-MEMCHECK
          thread 0 saw flag as 0
          thread 1 saw flag as 1
          thread 2 saw flag as 0
          thread 3 saw flag as 1
          thread 4 saw flag as 0
          thread 5 saw flag as 1
          thread 6 saw flag as 0
          thread 7 saw flag as 1
          thread 8 saw flag as 0
          thread 9 saw flag as 1
          thread 10 saw flag as 0
          thread 11 saw flag as 1
          thread 12 saw flag as 0
          thread 13 saw flag as 1
          thread 14 saw flag as 0
          thread 15 saw flag as 1
          thread 16 saw flag as 0
          thread 17 saw flag as 1
          thread 18 saw flag as 0
          thread 19 saw flag as 1
          thread 20 saw flag as 0
          thread 21 saw flag as 1
          thread 22 saw flag as 0
          thread 23 saw flag as 1
          thread 24 saw flag as 0
          thread 25 saw flag as 1
          thread 26 saw flag as 0
          thread 27 saw flag as 1
          thread 28 saw flag as 0
          thread 29 saw flag as 1
          thread 30 saw flag as 0
          thread 31 saw flag as 1
          ========= ERROR SUMMARY: 0 errors
          $


          The above presents a generalized atomicCAS for char size. This would allow you to swap any char value for any other char value. In your specific case, if you only need effectively a boolean flag, you can make this operation more efficient using atomicOr as already mentioned in the comments. The use of the atomicOr would allow you to eliminate the loop in the custom atomic function above. Here is a worked example:



          $ cat t327.cu
          #include <stdio.h>
          __device__ char my_char_atomic_flag(char *addr){
          unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
          unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
          unsigned my_bit = 1U << al_offset;
          return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU);
          }

          __global__ void k(){

          __shared__ char flag[1024];
          flag[threadIdx.x] = 0;
          __syncthreads();

          int retval = my_char_atomic_flag(flag+(threadIdx.x>>1));
          printf("thread %d saw flag as %dn", threadIdx.x, retval);
          }


          int main(){
          k<<<1,32>>>();
          cudaDeviceSynchronize();
          }
          $ nvcc -o t327 t327.cu
          $ cuda-memcheck ./t327
          ========= CUDA-MEMCHECK
          thread 0 saw flag as 0
          thread 1 saw flag as 1
          thread 2 saw flag as 0
          thread 3 saw flag as 1
          thread 4 saw flag as 0
          thread 5 saw flag as 1
          thread 6 saw flag as 0
          thread 7 saw flag as 1
          thread 8 saw flag as 0
          thread 9 saw flag as 1
          thread 10 saw flag as 0
          thread 11 saw flag as 1
          thread 12 saw flag as 0
          thread 13 saw flag as 1
          thread 14 saw flag as 0
          thread 15 saw flag as 1
          thread 16 saw flag as 0
          thread 17 saw flag as 1
          thread 18 saw flag as 0
          thread 19 saw flag as 1
          thread 20 saw flag as 0
          thread 21 saw flag as 1
          thread 22 saw flag as 0
          thread 23 saw flag as 1
          thread 24 saw flag as 0
          thread 25 saw flag as 1
          thread 26 saw flag as 0
          thread 27 saw flag as 1
          thread 28 saw flag as 0
          thread 29 saw flag as 1
          thread 30 saw flag as 0
          thread 31 saw flag as 1
          ========= ERROR SUMMARY: 0 errors
          $


          These char atomic methods assume that you have allocated a char array whose size is a multiple of 4. It would not be valid to do this with a char array of size 3 (and only 3 threads), for example.






          share|improve this answer














          You can use atomicCAS() for this. It does an atomic Compare-And-Swap operation.



          This function will test a variable, and if it matches a certain condition (say, false) it will replace it with another value (say, true). It will do all these things atomically, i.e. without the possibility of interruption.



          The return value of the atomic function gives us useful information in this case. If the return value is false for the above example, then we can be certain that it was replaced with true. We can also be certain that we were the "first" thread to run into this condition, and all other threads doing a similar operation will have a return value of true, not false.



          Here's a worked example:



          $ cat t327.cu
          #include <stdio.h>

          __global__ void k(){

          __shared__ int flag;
          if (threadIdx.x == 0) flag = 0;
          __syncthreads();

          int retval = atomicCAS(&flag, 0, 1);
          printf("thread %d saw flag as %dn", threadIdx.x, retval);
          // could do if statement on retval here
          }


          int main(){

          k<<<1,32>>>();
          cudaDeviceSynchronize();
          }
          $ nvcc -o t327 t327.cu
          $ cuda-memcheck ./t327
          ========= CUDA-MEMCHECK
          thread 0 saw flag as 0
          thread 1 saw flag as 1
          thread 2 saw flag as 1
          thread 3 saw flag as 1
          thread 4 saw flag as 1
          thread 5 saw flag as 1
          thread 6 saw flag as 1
          thread 7 saw flag as 1
          thread 8 saw flag as 1
          thread 9 saw flag as 1
          thread 10 saw flag as 1
          thread 11 saw flag as 1
          thread 12 saw flag as 1
          thread 13 saw flag as 1
          thread 14 saw flag as 1
          thread 15 saw flag as 1
          thread 16 saw flag as 1
          thread 17 saw flag as 1
          thread 18 saw flag as 1
          thread 19 saw flag as 1
          thread 20 saw flag as 1
          thread 21 saw flag as 1
          thread 22 saw flag as 1
          thread 23 saw flag as 1
          thread 24 saw flag as 1
          thread 25 saw flag as 1
          thread 26 saw flag as 1
          thread 27 saw flag as 1
          thread 28 saw flag as 1
          thread 29 saw flag as 1
          thread 30 saw flag as 1
          thread 31 saw flag as 1
          ========= ERROR SUMMARY: 0 errors
          $


          Responding to a question in the comments, we could extend this to a char sized flag by creating an arbitrary atomic operation modeled after the double atomicAdd() function given in the programming guide. The basic idea is that we will perform an atomicCAS using a supported data size (e.g. unsigned) and we will convert the needed operation to effectively support a char size. This is done by converting the char address to a suitably-aligned unsigned address, and then doing shifting of the char quantity to line up in the appropriate byte position in the unsigned value.



          Here is a worked example:



          $ cat t327.cu
          #include <stdio.h>
          __device__ char my_char_atomicCAS(char *addr, char cmp, char val){
          unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
          unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
          unsigned mask = 0xFFU;
          mask <<= al_offset;
          mask = ~mask;
          unsigned sval = val;
          sval <<= al_offset;
          unsigned old = *al_addr, assumed, setval;
          do {
          assumed = old;
          setval = assumed & mask;
          setval |= sval;
          old = atomicCAS(al_addr, assumed, setval);
          } while (assumed != old);
          return (char) ((assumed >> al_offset) & 0xFFU);
          }

          __global__ void k(){

          __shared__ char flag[1024];
          flag[threadIdx.x] = 0;
          __syncthreads();

          int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1);
          printf("thread %d saw flag as %dn", threadIdx.x, retval);
          }


          int main(){
          k<<<1,32>>>();
          cudaDeviceSynchronize();
          }
          $ nvcc -o t327 t327.cu
          $ cuda-memcheck ./t327
          ========= CUDA-MEMCHECK
          thread 0 saw flag as 0
          thread 1 saw flag as 1
          thread 2 saw flag as 0
          thread 3 saw flag as 1
          thread 4 saw flag as 0
          thread 5 saw flag as 1
          thread 6 saw flag as 0
          thread 7 saw flag as 1
          thread 8 saw flag as 0
          thread 9 saw flag as 1
          thread 10 saw flag as 0
          thread 11 saw flag as 1
          thread 12 saw flag as 0
          thread 13 saw flag as 1
          thread 14 saw flag as 0
          thread 15 saw flag as 1
          thread 16 saw flag as 0
          thread 17 saw flag as 1
          thread 18 saw flag as 0
          thread 19 saw flag as 1
          thread 20 saw flag as 0
          thread 21 saw flag as 1
          thread 22 saw flag as 0
          thread 23 saw flag as 1
          thread 24 saw flag as 0
          thread 25 saw flag as 1
          thread 26 saw flag as 0
          thread 27 saw flag as 1
          thread 28 saw flag as 0
          thread 29 saw flag as 1
          thread 30 saw flag as 0
          thread 31 saw flag as 1
          ========= ERROR SUMMARY: 0 errors
          $


          The above presents a generalized atomicCAS for char size. This would allow you to swap any char value for any other char value. In your specific case, if you only need effectively a boolean flag, you can make this operation more efficient using atomicOr as already mentioned in the comments. The use of the atomicOr would allow you to eliminate the loop in the custom atomic function above. Here is a worked example:



          $ cat t327.cu
          #include <stdio.h>
          __device__ char my_char_atomic_flag(char *addr){
          unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
          unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
          unsigned my_bit = 1U << al_offset;
          return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU);
          }

          __global__ void k(){

          __shared__ char flag[1024];
          flag[threadIdx.x] = 0;
          __syncthreads();

          int retval = my_char_atomic_flag(flag+(threadIdx.x>>1));
          printf("thread %d saw flag as %dn", threadIdx.x, retval);
          }


          int main(){
          k<<<1,32>>>();
          cudaDeviceSynchronize();
          }
          $ nvcc -o t327 t327.cu
          $ cuda-memcheck ./t327
          ========= CUDA-MEMCHECK
          thread 0 saw flag as 0
          thread 1 saw flag as 1
          thread 2 saw flag as 0
          thread 3 saw flag as 1
          thread 4 saw flag as 0
          thread 5 saw flag as 1
          thread 6 saw flag as 0
          thread 7 saw flag as 1
          thread 8 saw flag as 0
          thread 9 saw flag as 1
          thread 10 saw flag as 0
          thread 11 saw flag as 1
          thread 12 saw flag as 0
          thread 13 saw flag as 1
          thread 14 saw flag as 0
          thread 15 saw flag as 1
          thread 16 saw flag as 0
          thread 17 saw flag as 1
          thread 18 saw flag as 0
          thread 19 saw flag as 1
          thread 20 saw flag as 0
          thread 21 saw flag as 1
          thread 22 saw flag as 0
          thread 23 saw flag as 1
          thread 24 saw flag as 0
          thread 25 saw flag as 1
          thread 26 saw flag as 0
          thread 27 saw flag as 1
          thread 28 saw flag as 0
          thread 29 saw flag as 1
          thread 30 saw flag as 0
          thread 31 saw flag as 1
          ========= ERROR SUMMARY: 0 errors
          $


          These char atomic methods assume that you have allocated a char array whose size is a multiple of 4. It would not be valid to do this with a char array of size 3 (and only 3 threads), for example.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 21 at 4:35

























          answered Nov 19 at 17:46









          Robert Crovella

          93.1k4101143




          93.1k4101143












          • Great. It works correctly. But how can I apply it to use 'char' ? It accepts only int values.
            – Amani Elsaed
            Nov 20 at 2:12










          • You can use atomicOr() to reduce memory consumption even down to one bit per flag - but you'll have to fiddle with with bitmasks and array offsets.
            – tera
            Nov 20 at 5:37












          • @Robert thanks for your detailed answer. It helped me to achieve my target and it works fine now. Thanks !!
            – Amani Elsaed
            Nov 25 at 0:18


















          • Great. It works correctly. But how can I apply it to use 'char' ? It accepts only int values.
            – Amani Elsaed
            Nov 20 at 2:12










          • You can use atomicOr() to reduce memory consumption even down to one bit per flag - but you'll have to fiddle with with bitmasks and array offsets.
            – tera
            Nov 20 at 5:37












          • @Robert thanks for your detailed answer. It helped me to achieve my target and it works fine now. Thanks !!
            – Amani Elsaed
            Nov 25 at 0:18
















          Great. It works correctly. But how can I apply it to use 'char' ? It accepts only int values.
          – Amani Elsaed
          Nov 20 at 2:12




          Great. It works correctly. But how can I apply it to use 'char' ? It accepts only int values.
          – Amani Elsaed
          Nov 20 at 2:12












          You can use atomicOr() to reduce memory consumption even down to one bit per flag - but you'll have to fiddle with with bitmasks and array offsets.
          – tera
          Nov 20 at 5:37






          You can use atomicOr() to reduce memory consumption even down to one bit per flag - but you'll have to fiddle with with bitmasks and array offsets.
          – tera
          Nov 20 at 5:37














          @Robert thanks for your detailed answer. It helped me to achieve my target and it works fine now. Thanks !!
          – Amani Elsaed
          Nov 25 at 0:18




          @Robert thanks for your detailed answer. It helped me to achieve my target and it works fine now. Thanks !!
          – Amani Elsaed
          Nov 25 at 0:18


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53366421%2fcuda-byte-atomic-operation-to-cause-only-one-thread-to-act%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          404 Error Contact Form 7 ajax form submitting

          How to know if a Active Directory user can login interactively

          Refactoring coordinates for Minecraft Pi buildings written in Python