CUDA byte atomic operation to cause only one thread to act

up vote
0
down vote

favorite

I am writing a CUDA program which has an array defined in the shared memory. What I need to do is to allow only one thread to write each index in this array, i. e. the first thread to reach this write instruction should change its value but any other threads either in the same warp or next warps should read the written value.

Here is the code snippet:

char* seq_copied = seqs + (njobNew * halfLength); //this is the shared memory array

if (seq_copied[seq_1_index] == false) { //here is the condition that I need to check with only one thread

    seq_copied[seq_1_index] = true; //and this is the write that should be written by only one thread

    printf("copy seq_shared seq_1_index = %d,  block = %d n", seq_1_index, blockIdx.x);

}

What is happening now is that all threads in the warp executes these exact sequence of instructions, thus the remaining code in the if condition is executed 32 times. I need to execute it only once.

How can I achieve that ?

edited Nov 21 at 18:18

Robert Crovella

93.1k4101143

asked Nov 18 at 23:19

Amani Elsaed

551519

add a comment |

up vote
0
down vote

favorite

Here is the code snippet:

char* seq_copied = seqs + (njobNew * halfLength); //this is the shared memory array

if (seq_copied[seq_1_index] == false) { //here is the condition that I need to check with only one thread

    seq_copied[seq_1_index] = true; //and this is the write that should be written by only one thread

    printf("copy seq_shared seq_1_index = %d,  block = %d n", seq_1_index, blockIdx.x);

}

What is happening now is that all threads in the warp executes these exact sequence of instructions, thus the remaining code in the if condition is executed 32 times. I need to execute it only once.

How can I achieve that ?

edited Nov 21 at 18:18

Robert Crovella

93.1k4101143

asked Nov 18 at 23:19

Amani Elsaed

551519

add a comment |

up vote
0
down vote

favorite

Here is the code snippet:

char* seq_copied = seqs + (njobNew * halfLength); //this is the shared memory array

if (seq_copied[seq_1_index] == false) { //here is the condition that I need to check with only one thread

    seq_copied[seq_1_index] = true; //and this is the write that should be written by only one thread

    printf("copy seq_shared seq_1_index = %d,  block = %d n", seq_1_index, blockIdx.x);

}

What is happening now is that all threads in the warp executes these exact sequence of instructions, thus the remaining code in the if condition is executed 32 times. I need to execute it only once.

How can I achieve that ?

edited Nov 21 at 18:18

Robert Crovella

93.1k4101143

asked Nov 18 at 23:19

Amani Elsaed

551519

Here is the code snippet:

char* seq_copied = seqs + (njobNew * halfLength); //this is the shared memory array

if (seq_copied[seq_1_index] == false) { //here is the condition that I need to check with only one thread

    seq_copied[seq_1_index] = true; //and this is the write that should be written by only one thread

    printf("copy seq_shared seq_1_index = %d,  block = %d n", seq_1_index, blockIdx.x);

}

What is happening now is that all threads in the warp executes these exact sequence of instructions, thus the remaining code in the if condition is executed 32 times. I need to execute it only once.

How can I achieve that ?

cuda shared-memory atomic

edited Nov 21 at 18:18

Robert Crovella

93.1k4101143

asked Nov 18 at 23:19

Amani Elsaed

551519

edited Nov 21 at 18:18

Robert Crovella

93.1k4101143

asked Nov 18 at 23:19

Amani Elsaed

551519

edited Nov 21 at 18:18

Robert Crovella

93.1k4101143

edited Nov 21 at 18:18

Robert Crovella

93.1k4101143

edited Nov 21 at 18:18

Robert Crovella

93.1k4101143

asked Nov 18 at 23:19

Amani Elsaed

551519

asked Nov 18 at 23:19

Amani Elsaed

551519

asked Nov 18 at 23:19

Amani Elsaed

551519

add a comment |

1 Answer
1

active

oldest

votes

up vote
3
down vote

accepted

You can use atomicCAS() for this. It does an atomic Compare-And-Swap operation.

This function will test a variable, and if it matches a certain condition (say, false) it will replace it with another value (say, true). It will do all these things atomically, i.e. without the possibility of interruption.

The return value of the atomic function gives us useful information in this case. If the return value is false for the above example, then we can be certain that it was replaced with true. We can also be certain that we were the "first" thread to run into this condition, and all other threads doing a similar operation will have a return value of true, not false.

Here's a worked example:

$ cat t327.cu

#include <stdio.h>



__global__ void k(){



  __shared__ int flag;

  if (threadIdx.x == 0) flag = 0;

  __syncthreads();



  int retval = atomicCAS(&flag, 0, 1);

  printf("thread %d saw flag as %dn", threadIdx.x, retval);

  // could do if statement on retval here

}





int main(){



  k<<<1,32>>>();

  cudaDeviceSynchronize();

}

$ nvcc -o t327 t327.cu

$ cuda-memcheck ./t327

========= CUDA-MEMCHECK

thread 0 saw flag as 0

thread 1 saw flag as 1

thread 2 saw flag as 1

thread 3 saw flag as 1

thread 4 saw flag as 1

thread 5 saw flag as 1

thread 6 saw flag as 1

thread 7 saw flag as 1

thread 8 saw flag as 1

thread 9 saw flag as 1

thread 10 saw flag as 1

thread 11 saw flag as 1

thread 12 saw flag as 1

thread 13 saw flag as 1

thread 14 saw flag as 1

thread 15 saw flag as 1

thread 16 saw flag as 1

thread 17 saw flag as 1

thread 18 saw flag as 1

thread 19 saw flag as 1

thread 20 saw flag as 1

thread 21 saw flag as 1

thread 22 saw flag as 1

thread 23 saw flag as 1

thread 24 saw flag as 1

thread 25 saw flag as 1

thread 26 saw flag as 1

thread 27 saw flag as 1

thread 28 saw flag as 1

thread 29 saw flag as 1

thread 30 saw flag as 1

thread 31 saw flag as 1

========= ERROR SUMMARY: 0 errors

$

Responding to a question in the comments, we could extend this to a char sized flag by creating an arbitrary atomic operation modeled after the double atomicAdd() function given in the programming guide. The basic idea is that we will perform an atomicCAS using a supported data size (e.g. unsigned) and we will convert the needed operation to effectively support a char size. This is done by converting the char address to a suitably-aligned unsigned address, and then doing shifting of the char quantity to line up in the appropriate byte position in the unsigned value.

Here is a worked example:

$ cat t327.cu

#include <stdio.h>

__device__ char my_char_atomicCAS(char *addr, char cmp, char val){

  unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));

  unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;

  unsigned mask = 0xFFU;

  mask <<= al_offset;

  mask = ~mask;

  unsigned sval = val;

  sval <<= al_offset;

  unsigned old = *al_addr, assumed, setval;

  do {

        assumed = old;

        setval = assumed & mask;

        setval |= sval;

        old = atomicCAS(al_addr, assumed, setval);

    } while (assumed != old);

  return (char) ((assumed >> al_offset) & 0xFFU);

}



__global__ void k(){



  __shared__ char flag[1024];

  flag[threadIdx.x] = 0;

  __syncthreads();



  int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1);

  printf("thread %d saw flag as %dn", threadIdx.x, retval);

}





int main(){

  k<<<1,32>>>();

  cudaDeviceSynchronize();

}

$ nvcc -o t327 t327.cu

$ cuda-memcheck ./t327

========= CUDA-MEMCHECK

thread 0 saw flag as 0

thread 1 saw flag as 1

thread 2 saw flag as 0

thread 3 saw flag as 1

thread 4 saw flag as 0

thread 5 saw flag as 1

thread 6 saw flag as 0

thread 7 saw flag as 1

thread 8 saw flag as 0

thread 9 saw flag as 1

thread 10 saw flag as 0

thread 11 saw flag as 1

thread 12 saw flag as 0

thread 13 saw flag as 1

thread 14 saw flag as 0

thread 15 saw flag as 1

thread 16 saw flag as 0

thread 17 saw flag as 1

thread 18 saw flag as 0

thread 19 saw flag as 1

thread 20 saw flag as 0

thread 21 saw flag as 1

thread 22 saw flag as 0

thread 23 saw flag as 1

thread 24 saw flag as 0

thread 25 saw flag as 1

thread 26 saw flag as 0

thread 27 saw flag as 1

thread 28 saw flag as 0

thread 29 saw flag as 1

thread 30 saw flag as 0

thread 31 saw flag as 1

========= ERROR SUMMARY: 0 errors

$

The above presents a generalized atomicCAS for char size. This would allow you to swap any char value for any other char value. In your specific case, if you only need effectively a boolean flag, you can make this operation more efficient using atomicOr as already mentioned in the comments. The use of the atomicOr would allow you to eliminate the loop in the custom atomic function above. Here is a worked example:

$ cat t327.cu

#include <stdio.h>

__device__ char my_char_atomic_flag(char *addr){

  unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));

  unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;

  unsigned my_bit = 1U << al_offset;

  return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU);

}



__global__ void k(){



  __shared__ char flag[1024];

  flag[threadIdx.x] = 0;

  __syncthreads();



  int retval = my_char_atomic_flag(flag+(threadIdx.x>>1));

  printf("thread %d saw flag as %dn", threadIdx.x, retval);

}





int main(){

  k<<<1,32>>>();

  cudaDeviceSynchronize();

}

$ nvcc -o t327 t327.cu

$ cuda-memcheck ./t327

========= CUDA-MEMCHECK

thread 0 saw flag as 0

thread 1 saw flag as 1

thread 2 saw flag as 0

thread 3 saw flag as 1

thread 4 saw flag as 0

thread 5 saw flag as 1

thread 6 saw flag as 0

thread 7 saw flag as 1

thread 8 saw flag as 0

thread 9 saw flag as 1

thread 10 saw flag as 0

thread 11 saw flag as 1

thread 12 saw flag as 0

thread 13 saw flag as 1

thread 14 saw flag as 0

thread 15 saw flag as 1

thread 16 saw flag as 0

thread 17 saw flag as 1

thread 18 saw flag as 0

thread 19 saw flag as 1

thread 20 saw flag as 0

thread 21 saw flag as 1

thread 22 saw flag as 0

thread 23 saw flag as 1

thread 24 saw flag as 0

thread 25 saw flag as 1

thread 26 saw flag as 0

thread 27 saw flag as 1

thread 28 saw flag as 0

thread 29 saw flag as 1

thread 30 saw flag as 0

thread 31 saw flag as 1

========= ERROR SUMMARY: 0 errors

$

These char atomic methods assume that you have allocated a char array whose size is a multiple of 4. It would not be valid to do this with a char array of size 3 (and only 3 threads), for example.

edited Nov 21 at 4:35

answered Nov 19 at 17:46

Robert Crovella

93.1k4101143

Great. It works correctly. But how can I apply it to use 'char' ? It accepts only int values.
– Amani Elsaed
Nov 20 at 2:12

You can use atomicOr() to reduce memory consumption even down to one bit per flag - but you'll have to fiddle with with bitmasks and array offsets.
– tera
Nov 20 at 5:37

@Robert thanks for your detailed answer. It helped me to achieve my target and it works fine now. Thanks !!
– Amani Elsaed
Nov 25 at 0:18

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53366421%2fcuda-byte-atomic-operation-to-cause-only-one-thread-to-act%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
3
down vote

accepted

You can use atomicCAS() for this. It does an atomic Compare-And-Swap operation.

Here's a worked example:

$ cat t327.cu

#include <stdio.h>



__global__ void k(){



  __shared__ int flag;

  if (threadIdx.x == 0) flag = 0;

  __syncthreads();



  int retval = atomicCAS(&flag, 0, 1);

  printf("thread %d saw flag as %dn", threadIdx.x, retval);

  // could do if statement on retval here

}





int main(){



  k<<<1,32>>>();

  cudaDeviceSynchronize();

}

$ nvcc -o t327 t327.cu

$ cuda-memcheck ./t327

========= CUDA-MEMCHECK

thread 0 saw flag as 0

thread 1 saw flag as 1

thread 2 saw flag as 1

thread 3 saw flag as 1

thread 4 saw flag as 1

thread 5 saw flag as 1

thread 6 saw flag as 1

thread 7 saw flag as 1

thread 8 saw flag as 1

thread 9 saw flag as 1

thread 10 saw flag as 1

thread 11 saw flag as 1

thread 12 saw flag as 1

thread 13 saw flag as 1

thread 14 saw flag as 1

thread 15 saw flag as 1

thread 16 saw flag as 1

thread 17 saw flag as 1

thread 18 saw flag as 1

thread 19 saw flag as 1

thread 20 saw flag as 1

thread 21 saw flag as 1

thread 22 saw flag as 1

thread 23 saw flag as 1

thread 24 saw flag as 1

thread 25 saw flag as 1

thread 26 saw flag as 1

thread 27 saw flag as 1

thread 28 saw flag as 1

thread 29 saw flag as 1

thread 30 saw flag as 1

thread 31 saw flag as 1

========= ERROR SUMMARY: 0 errors

$

Here is a worked example:

$ cat t327.cu

#include <stdio.h>

__device__ char my_char_atomicCAS(char *addr, char cmp, char val){

  unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));

  unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;

  unsigned mask = 0xFFU;

  mask <<= al_offset;

  mask = ~mask;

  unsigned sval = val;

  sval <<= al_offset;

  unsigned old = *al_addr, assumed, setval;

  do {

        assumed = old;

        setval = assumed & mask;

        setval |= sval;

        old = atomicCAS(al_addr, assumed, setval);

    } while (assumed != old);

  return (char) ((assumed >> al_offset) & 0xFFU);

}



__global__ void k(){



  __shared__ char flag[1024];

  flag[threadIdx.x] = 0;

  __syncthreads();



  int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1);

  printf("thread %d saw flag as %dn", threadIdx.x, retval);

}





int main(){

  k<<<1,32>>>();

  cudaDeviceSynchronize();

}

$ nvcc -o t327 t327.cu

$ cuda-memcheck ./t327

========= CUDA-MEMCHECK

thread 0 saw flag as 0

thread 1 saw flag as 1

thread 2 saw flag as 0

thread 3 saw flag as 1

thread 4 saw flag as 0

thread 5 saw flag as 1

thread 6 saw flag as 0

thread 7 saw flag as 1

thread 8 saw flag as 0

thread 9 saw flag as 1

thread 10 saw flag as 0

thread 11 saw flag as 1

thread 12 saw flag as 0

thread 13 saw flag as 1

thread 14 saw flag as 0

thread 15 saw flag as 1

thread 16 saw flag as 0

thread 17 saw flag as 1

thread 18 saw flag as 0

thread 19 saw flag as 1

thread 20 saw flag as 0

thread 21 saw flag as 1

thread 22 saw flag as 0

thread 23 saw flag as 1

thread 24 saw flag as 0

thread 25 saw flag as 1

thread 26 saw flag as 0

thread 27 saw flag as 1

thread 28 saw flag as 0

thread 29 saw flag as 1

thread 30 saw flag as 0

thread 31 saw flag as 1

========= ERROR SUMMARY: 0 errors

$

$ cat t327.cu

#include <stdio.h>

__device__ char my_char_atomic_flag(char *addr){

  unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));

  unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;

  unsigned my_bit = 1U << al_offset;

  return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU);

}



__global__ void k(){



  __shared__ char flag[1024];

  flag[threadIdx.x] = 0;

  __syncthreads();



  int retval = my_char_atomic_flag(flag+(threadIdx.x>>1));

  printf("thread %d saw flag as %dn", threadIdx.x, retval);

}





int main(){

  k<<<1,32>>>();

  cudaDeviceSynchronize();

}

$ nvcc -o t327 t327.cu

$ cuda-memcheck ./t327

========= CUDA-MEMCHECK

thread 0 saw flag as 0

thread 1 saw flag as 1

thread 2 saw flag as 0

thread 3 saw flag as 1

thread 4 saw flag as 0

thread 5 saw flag as 1

thread 6 saw flag as 0

thread 7 saw flag as 1

thread 8 saw flag as 0

thread 9 saw flag as 1

thread 10 saw flag as 0

thread 11 saw flag as 1

thread 12 saw flag as 0

thread 13 saw flag as 1

thread 14 saw flag as 0

thread 15 saw flag as 1

thread 16 saw flag as 0

thread 17 saw flag as 1

thread 18 saw flag as 0

thread 19 saw flag as 1

thread 20 saw flag as 0

thread 21 saw flag as 1

thread 22 saw flag as 0

thread 23 saw flag as 1

thread 24 saw flag as 0

thread 25 saw flag as 1

thread 26 saw flag as 0

thread 27 saw flag as 1

thread 28 saw flag as 0

thread 29 saw flag as 1

thread 30 saw flag as 0

thread 31 saw flag as 1

========= ERROR SUMMARY: 0 errors

$

edited Nov 21 at 4:35

answered Nov 19 at 17:46

Robert Crovella

93.1k4101143

Great. It works correctly. But how can I apply it to use 'char' ? It accepts only int values.
– Amani Elsaed
Nov 20 at 2:12

You can use atomicOr() to reduce memory consumption even down to one bit per flag - but you'll have to fiddle with with bitmasks and array offsets.
– tera
Nov 20 at 5:37

@Robert thanks for your detailed answer. It helped me to achieve my target and it works fine now. Thanks !!
– Amani Elsaed
Nov 25 at 0:18

add a comment |

up vote
3
down vote

accepted

You can use atomicCAS() for this. It does an atomic Compare-And-Swap operation.

Here's a worked example:

$ cat t327.cu

#include <stdio.h>



__global__ void k(){



  __shared__ int flag;

  if (threadIdx.x == 0) flag = 0;

  __syncthreads();



  int retval = atomicCAS(&flag, 0, 1);

  printf("thread %d saw flag as %dn", threadIdx.x, retval);

  // could do if statement on retval here

}





int main(){



  k<<<1,32>>>();

  cudaDeviceSynchronize();

}

$ nvcc -o t327 t327.cu

$ cuda-memcheck ./t327

========= CUDA-MEMCHECK

thread 0 saw flag as 0

thread 1 saw flag as 1

thread 2 saw flag as 1

thread 3 saw flag as 1

thread 4 saw flag as 1

thread 5 saw flag as 1

thread 6 saw flag as 1

thread 7 saw flag as 1

thread 8 saw flag as 1

thread 9 saw flag as 1

thread 10 saw flag as 1

thread 11 saw flag as 1

thread 12 saw flag as 1

thread 13 saw flag as 1

thread 14 saw flag as 1

thread 15 saw flag as 1

thread 16 saw flag as 1

thread 17 saw flag as 1

thread 18 saw flag as 1

thread 19 saw flag as 1

thread 20 saw flag as 1

thread 21 saw flag as 1

thread 22 saw flag as 1

thread 23 saw flag as 1

thread 24 saw flag as 1

thread 25 saw flag as 1

thread 26 saw flag as 1

thread 27 saw flag as 1

thread 28 saw flag as 1

thread 29 saw flag as 1

thread 30 saw flag as 1

thread 31 saw flag as 1

========= ERROR SUMMARY: 0 errors

$

Here is a worked example:

$ cat t327.cu

#include <stdio.h>

__device__ char my_char_atomicCAS(char *addr, char cmp, char val){

  unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));

  unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;

  unsigned mask = 0xFFU;

  mask <<= al_offset;

  mask = ~mask;

  unsigned sval = val;

  sval <<= al_offset;

  unsigned old = *al_addr, assumed, setval;

  do {

        assumed = old;

        setval = assumed & mask;

        setval |= sval;

        old = atomicCAS(al_addr, assumed, setval);

    } while (assumed != old);

  return (char) ((assumed >> al_offset) & 0xFFU);

}



__global__ void k(){



  __shared__ char flag[1024];

  flag[threadIdx.x] = 0;

  __syncthreads();



  int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1);

  printf("thread %d saw flag as %dn", threadIdx.x, retval);

}





int main(){

  k<<<1,32>>>();

  cudaDeviceSynchronize();

}

$ nvcc -o t327 t327.cu

$ cuda-memcheck ./t327

========= CUDA-MEMCHECK

thread 0 saw flag as 0

thread 1 saw flag as 1

thread 2 saw flag as 0

thread 3 saw flag as 1

thread 4 saw flag as 0

thread 5 saw flag as 1

thread 6 saw flag as 0

thread 7 saw flag as 1

thread 8 saw flag as 0

thread 9 saw flag as 1

thread 10 saw flag as 0

thread 11 saw flag as 1

thread 12 saw flag as 0

thread 13 saw flag as 1

thread 14 saw flag as 0

thread 15 saw flag as 1

thread 16 saw flag as 0

thread 17 saw flag as 1

thread 18 saw flag as 0

thread 19 saw flag as 1

thread 20 saw flag as 0

thread 21 saw flag as 1

thread 22 saw flag as 0

thread 23 saw flag as 1

thread 24 saw flag as 0

thread 25 saw flag as 1

thread 26 saw flag as 0

thread 27 saw flag as 1

thread 28 saw flag as 0

thread 29 saw flag as 1

thread 30 saw flag as 0

thread 31 saw flag as 1

========= ERROR SUMMARY: 0 errors

$

$ cat t327.cu

#include <stdio.h>

__device__ char my_char_atomic_flag(char *addr){

  unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));

  unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;

  unsigned my_bit = 1U << al_offset;

  return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU);

}



__global__ void k(){



  __shared__ char flag[1024];

  flag[threadIdx.x] = 0;

  __syncthreads();



  int retval = my_char_atomic_flag(flag+(threadIdx.x>>1));

  printf("thread %d saw flag as %dn", threadIdx.x, retval);

}





int main(){

  k<<<1,32>>>();

  cudaDeviceSynchronize();

}

$ nvcc -o t327 t327.cu

$ cuda-memcheck ./t327

========= CUDA-MEMCHECK

thread 0 saw flag as 0

thread 1 saw flag as 1

thread 2 saw flag as 0

thread 3 saw flag as 1

thread 4 saw flag as 0

thread 5 saw flag as 1

thread 6 saw flag as 0

thread 7 saw flag as 1

thread 8 saw flag as 0

thread 9 saw flag as 1

thread 10 saw flag as 0

thread 11 saw flag as 1

thread 12 saw flag as 0

thread 13 saw flag as 1

thread 14 saw flag as 0

thread 15 saw flag as 1

thread 16 saw flag as 0

thread 17 saw flag as 1

thread 18 saw flag as 0

thread 19 saw flag as 1

thread 20 saw flag as 0

thread 21 saw flag as 1

thread 22 saw flag as 0

thread 23 saw flag as 1

thread 24 saw flag as 0

thread 25 saw flag as 1

thread 26 saw flag as 0

thread 27 saw flag as 1

thread 28 saw flag as 0

thread 29 saw flag as 1

thread 30 saw flag as 0

thread 31 saw flag as 1

========= ERROR SUMMARY: 0 errors

$

edited Nov 21 at 4:35

answered Nov 19 at 17:46

Robert Crovella

93.1k4101143

Great. It works correctly. But how can I apply it to use 'char' ? It accepts only int values.
– Amani Elsaed
Nov 20 at 2:12

You can use atomicOr() to reduce memory consumption even down to one bit per flag - but you'll have to fiddle with with bitmasks and array offsets.
– tera
Nov 20 at 5:37

@Robert thanks for your detailed answer. It helped me to achieve my target and it works fine now. Thanks !!
– Amani Elsaed
Nov 25 at 0:18

add a comment |

up vote
3
down vote

accepted

You can use atomicCAS() for this. It does an atomic Compare-And-Swap operation.

Here's a worked example:

$ cat t327.cu

#include <stdio.h>



__global__ void k(){



  __shared__ int flag;

  if (threadIdx.x == 0) flag = 0;

  __syncthreads();



  int retval = atomicCAS(&flag, 0, 1);

  printf("thread %d saw flag as %dn", threadIdx.x, retval);

  // could do if statement on retval here

}





int main(){



  k<<<1,32>>>();

  cudaDeviceSynchronize();

}

$ nvcc -o t327 t327.cu

$ cuda-memcheck ./t327

========= CUDA-MEMCHECK

thread 0 saw flag as 0

thread 1 saw flag as 1

thread 2 saw flag as 1

thread 3 saw flag as 1

thread 4 saw flag as 1

thread 5 saw flag as 1

thread 6 saw flag as 1

thread 7 saw flag as 1

thread 8 saw flag as 1

thread 9 saw flag as 1

thread 10 saw flag as 1

thread 11 saw flag as 1

thread 12 saw flag as 1

thread 13 saw flag as 1

thread 14 saw flag as 1

thread 15 saw flag as 1

thread 16 saw flag as 1

thread 17 saw flag as 1

thread 18 saw flag as 1

thread 19 saw flag as 1

thread 20 saw flag as 1

thread 21 saw flag as 1

thread 22 saw flag as 1

thread 23 saw flag as 1

thread 24 saw flag as 1

thread 25 saw flag as 1

thread 26 saw flag as 1

thread 27 saw flag as 1

thread 28 saw flag as 1

thread 29 saw flag as 1

thread 30 saw flag as 1

thread 31 saw flag as 1

========= ERROR SUMMARY: 0 errors

$

Here is a worked example:

$ cat t327.cu

#include <stdio.h>

__device__ char my_char_atomicCAS(char *addr, char cmp, char val){

  unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));

  unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;

  unsigned mask = 0xFFU;

  mask <<= al_offset;

  mask = ~mask;

  unsigned sval = val;

  sval <<= al_offset;

  unsigned old = *al_addr, assumed, setval;

  do {

        assumed = old;

        setval = assumed & mask;

        setval |= sval;

        old = atomicCAS(al_addr, assumed, setval);

    } while (assumed != old);

  return (char) ((assumed >> al_offset) & 0xFFU);

}



__global__ void k(){



  __shared__ char flag[1024];

  flag[threadIdx.x] = 0;

  __syncthreads();



  int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1);

  printf("thread %d saw flag as %dn", threadIdx.x, retval);

}





int main(){

  k<<<1,32>>>();

  cudaDeviceSynchronize();

}

$ nvcc -o t327 t327.cu

$ cuda-memcheck ./t327

========= CUDA-MEMCHECK

thread 0 saw flag as 0

thread 1 saw flag as 1

thread 2 saw flag as 0

thread 3 saw flag as 1

thread 4 saw flag as 0

thread 5 saw flag as 1

thread 6 saw flag as 0

thread 7 saw flag as 1

thread 8 saw flag as 0

thread 9 saw flag as 1

thread 10 saw flag as 0

thread 11 saw flag as 1

thread 12 saw flag as 0

thread 13 saw flag as 1

thread 14 saw flag as 0

thread 15 saw flag as 1

thread 16 saw flag as 0

thread 17 saw flag as 1

thread 18 saw flag as 0

thread 19 saw flag as 1

thread 20 saw flag as 0

thread 21 saw flag as 1

thread 22 saw flag as 0

thread 23 saw flag as 1

thread 24 saw flag as 0

thread 25 saw flag as 1

thread 26 saw flag as 0

thread 27 saw flag as 1

thread 28 saw flag as 0

thread 29 saw flag as 1

thread 30 saw flag as 0

thread 31 saw flag as 1

========= ERROR SUMMARY: 0 errors

$

$ cat t327.cu

#include <stdio.h>

__device__ char my_char_atomic_flag(char *addr){

  unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));

  unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;

  unsigned my_bit = 1U << al_offset;

  return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU);

}



__global__ void k(){



  __shared__ char flag[1024];

  flag[threadIdx.x] = 0;

  __syncthreads();



  int retval = my_char_atomic_flag(flag+(threadIdx.x>>1));

  printf("thread %d saw flag as %dn", threadIdx.x, retval);

}





int main(){

  k<<<1,32>>>();

  cudaDeviceSynchronize();

}

$ nvcc -o t327 t327.cu

$ cuda-memcheck ./t327

========= CUDA-MEMCHECK

thread 0 saw flag as 0

thread 1 saw flag as 1

thread 2 saw flag as 0

thread 3 saw flag as 1

thread 4 saw flag as 0

thread 5 saw flag as 1

thread 6 saw flag as 0

thread 7 saw flag as 1

thread 8 saw flag as 0

thread 9 saw flag as 1

thread 10 saw flag as 0

thread 11 saw flag as 1

thread 12 saw flag as 0

thread 13 saw flag as 1

thread 14 saw flag as 0

thread 15 saw flag as 1

thread 16 saw flag as 0

thread 17 saw flag as 1

thread 18 saw flag as 0

thread 19 saw flag as 1

thread 20 saw flag as 0

thread 21 saw flag as 1

thread 22 saw flag as 0

thread 23 saw flag as 1

thread 24 saw flag as 0

thread 25 saw flag as 1

thread 26 saw flag as 0

thread 27 saw flag as 1

thread 28 saw flag as 0

thread 29 saw flag as 1

thread 30 saw flag as 0

thread 31 saw flag as 1

========= ERROR SUMMARY: 0 errors

$

edited Nov 21 at 4:35

answered Nov 19 at 17:46

Robert Crovella

93.1k4101143

You can use atomicCAS() for this. It does an atomic Compare-And-Swap operation.

Here's a worked example:

$ cat t327.cu

#include <stdio.h>



__global__ void k(){



  __shared__ int flag;

  if (threadIdx.x == 0) flag = 0;

  __syncthreads();



  int retval = atomicCAS(&flag, 0, 1);

  printf("thread %d saw flag as %dn", threadIdx.x, retval);

  // could do if statement on retval here

}





int main(){



  k<<<1,32>>>();

  cudaDeviceSynchronize();

}

$ nvcc -o t327 t327.cu

$ cuda-memcheck ./t327

========= CUDA-MEMCHECK

thread 0 saw flag as 0

thread 1 saw flag as 1

thread 2 saw flag as 1

thread 3 saw flag as 1

thread 4 saw flag as 1

thread 5 saw flag as 1

thread 6 saw flag as 1

thread 7 saw flag as 1

thread 8 saw flag as 1

thread 9 saw flag as 1

thread 10 saw flag as 1

thread 11 saw flag as 1

thread 12 saw flag as 1

thread 13 saw flag as 1

thread 14 saw flag as 1

thread 15 saw flag as 1

thread 16 saw flag as 1

thread 17 saw flag as 1

thread 18 saw flag as 1

thread 19 saw flag as 1

thread 20 saw flag as 1

thread 21 saw flag as 1

thread 22 saw flag as 1

thread 23 saw flag as 1

thread 24 saw flag as 1

thread 25 saw flag as 1

thread 26 saw flag as 1

thread 27 saw flag as 1

thread 28 saw flag as 1

thread 29 saw flag as 1

thread 30 saw flag as 1

thread 31 saw flag as 1

========= ERROR SUMMARY: 0 errors

$

Here is a worked example:

$ cat t327.cu

#include <stdio.h>

__device__ char my_char_atomicCAS(char *addr, char cmp, char val){

  unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));

  unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;

  unsigned mask = 0xFFU;

  mask <<= al_offset;

  mask = ~mask;

  unsigned sval = val;

  sval <<= al_offset;

  unsigned old = *al_addr, assumed, setval;

  do {

        assumed = old;

        setval = assumed & mask;

        setval |= sval;

        old = atomicCAS(al_addr, assumed, setval);

    } while (assumed != old);

  return (char) ((assumed >> al_offset) & 0xFFU);

}



__global__ void k(){



  __shared__ char flag[1024];

  flag[threadIdx.x] = 0;

  __syncthreads();



  int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1);

  printf("thread %d saw flag as %dn", threadIdx.x, retval);

}





int main(){

  k<<<1,32>>>();

  cudaDeviceSynchronize();

}

$ nvcc -o t327 t327.cu

$ cuda-memcheck ./t327

========= CUDA-MEMCHECK

thread 0 saw flag as 0

thread 1 saw flag as 1

thread 2 saw flag as 0

thread 3 saw flag as 1

thread 4 saw flag as 0

thread 5 saw flag as 1

thread 6 saw flag as 0

thread 7 saw flag as 1

thread 8 saw flag as 0

thread 9 saw flag as 1

thread 10 saw flag as 0

thread 11 saw flag as 1

thread 12 saw flag as 0

thread 13 saw flag as 1

thread 14 saw flag as 0

thread 15 saw flag as 1

thread 16 saw flag as 0

thread 17 saw flag as 1

thread 18 saw flag as 0

thread 19 saw flag as 1

thread 20 saw flag as 0

thread 21 saw flag as 1

thread 22 saw flag as 0

thread 23 saw flag as 1

thread 24 saw flag as 0

thread 25 saw flag as 1

thread 26 saw flag as 0

thread 27 saw flag as 1

thread 28 saw flag as 0

thread 29 saw flag as 1

thread 30 saw flag as 0

thread 31 saw flag as 1

========= ERROR SUMMARY: 0 errors

$

$ cat t327.cu

#include <stdio.h>

__device__ char my_char_atomic_flag(char *addr){

  unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));

  unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;

  unsigned my_bit = 1U << al_offset;

  return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU);

}



__global__ void k(){



  __shared__ char flag[1024];

  flag[threadIdx.x] = 0;

  __syncthreads();



  int retval = my_char_atomic_flag(flag+(threadIdx.x>>1));

  printf("thread %d saw flag as %dn", threadIdx.x, retval);

}





int main(){

  k<<<1,32>>>();

  cudaDeviceSynchronize();

}

$ nvcc -o t327 t327.cu

$ cuda-memcheck ./t327

========= CUDA-MEMCHECK

thread 0 saw flag as 0

thread 1 saw flag as 1

thread 2 saw flag as 0

thread 3 saw flag as 1

thread 4 saw flag as 0

thread 5 saw flag as 1

thread 6 saw flag as 0

thread 7 saw flag as 1

thread 8 saw flag as 0

thread 9 saw flag as 1

thread 10 saw flag as 0

thread 11 saw flag as 1

thread 12 saw flag as 0

thread 13 saw flag as 1

thread 14 saw flag as 0

thread 15 saw flag as 1

thread 16 saw flag as 0

thread 17 saw flag as 1

thread 18 saw flag as 0

thread 19 saw flag as 1

thread 20 saw flag as 0

thread 21 saw flag as 1

thread 22 saw flag as 0

thread 23 saw flag as 1

thread 24 saw flag as 0

thread 25 saw flag as 1

thread 26 saw flag as 0

thread 27 saw flag as 1

thread 28 saw flag as 0

thread 29 saw flag as 1

thread 30 saw flag as 0

thread 31 saw flag as 1

========= ERROR SUMMARY: 0 errors

$

edited Nov 21 at 4:35

answered Nov 19 at 17:46

Robert Crovella

93.1k4101143

edited Nov 21 at 4:35

answered Nov 19 at 17:46

Robert Crovella

93.1k4101143

answered Nov 19 at 17:46

Robert Crovella

93.1k4101143

answered Nov 19 at 17:46

Robert Crovella

93.1k4101143

Great. It works correctly. But how can I apply it to use 'char' ? It accepts only int values.
– Amani Elsaed
Nov 20 at 2:12

You can use atomicOr() to reduce memory consumption even down to one bit per flag - but you'll have to fiddle with with bitmasks and array offsets.
– tera
Nov 20 at 5:37

@Robert thanks for your detailed answer. It helped me to achieve my target and it works fine now. Thanks !!
– Amani Elsaed
Nov 25 at 0:18

add a comment |

Great. It works correctly. But how can I apply it to use 'char' ? It accepts only int values.
– Amani Elsaed
Nov 20 at 2:12

You can use atomicOr() to reduce memory consumption even down to one bit per flag - but you'll have to fiddle with with bitmasks and array offsets.
– tera
Nov 20 at 5:37

@Robert thanks for your detailed answer. It helped me to achieve my target and it works fine now. Thanks !!
– Amani Elsaed
Nov 25 at 0:18

Great. It works correctly. But how can I apply it to use 'char' ? It accepts only int values.
– Amani Elsaed
Nov 20 at 2:12

You can use atomicOr() to reduce memory consumption even down to one bit per flag - but you'll have to fiddle with with bitmasks and array offsets.
– tera
Nov 20 at 5:37

@Robert thanks for your detailed answer. It helped me to achieve my target and it works fine now. Thanks !!
– Amani Elsaed
Nov 25 at 0:18

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

oMctDoBw9y2sHmaGGSU28ZBgRz,wqU6B O1Mz3rpTH6O,m9nATpWsCywrkA5l8uSi9 p8fr

搜尋此網誌

Tukukkk