CUDA byte atomic operation to cause only one thread to act
up vote
0
down vote
favorite
I am writing a CUDA program which has an array defined in the shared memory. What I need to do is to allow only one thread to write each index in this array, i. e. the first thread to reach this write instruction should change its value but any other threads either in the same warp or next warps should read the written value.
Here is the code snippet:
char* seq_copied = seqs + (njobNew * halfLength); //this is the shared memory array
if (seq_copied[seq_1_index] == false) { //here is the condition that I need to check with only one thread
seq_copied[seq_1_index] = true; //and this is the write that should be written by only one thread
printf("copy seq_shared seq_1_index = %d, block = %d n", seq_1_index, blockIdx.x);
}
What is happening now is that all threads in the warp executes these exact sequence of instructions, thus the remaining code in the if condition is executed 32 times. I need to execute it only once.
How can I achieve that ?
cuda shared-memory atomic
add a comment |
up vote
0
down vote
favorite
I am writing a CUDA program which has an array defined in the shared memory. What I need to do is to allow only one thread to write each index in this array, i. e. the first thread to reach this write instruction should change its value but any other threads either in the same warp or next warps should read the written value.
Here is the code snippet:
char* seq_copied = seqs + (njobNew * halfLength); //this is the shared memory array
if (seq_copied[seq_1_index] == false) { //here is the condition that I need to check with only one thread
seq_copied[seq_1_index] = true; //and this is the write that should be written by only one thread
printf("copy seq_shared seq_1_index = %d, block = %d n", seq_1_index, blockIdx.x);
}
What is happening now is that all threads in the warp executes these exact sequence of instructions, thus the remaining code in the if condition is executed 32 times. I need to execute it only once.
How can I achieve that ?
cuda shared-memory atomic
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I am writing a CUDA program which has an array defined in the shared memory. What I need to do is to allow only one thread to write each index in this array, i. e. the first thread to reach this write instruction should change its value but any other threads either in the same warp or next warps should read the written value.
Here is the code snippet:
char* seq_copied = seqs + (njobNew * halfLength); //this is the shared memory array
if (seq_copied[seq_1_index] == false) { //here is the condition that I need to check with only one thread
seq_copied[seq_1_index] = true; //and this is the write that should be written by only one thread
printf("copy seq_shared seq_1_index = %d, block = %d n", seq_1_index, blockIdx.x);
}
What is happening now is that all threads in the warp executes these exact sequence of instructions, thus the remaining code in the if condition is executed 32 times. I need to execute it only once.
How can I achieve that ?
cuda shared-memory atomic
I am writing a CUDA program which has an array defined in the shared memory. What I need to do is to allow only one thread to write each index in this array, i. e. the first thread to reach this write instruction should change its value but any other threads either in the same warp or next warps should read the written value.
Here is the code snippet:
char* seq_copied = seqs + (njobNew * halfLength); //this is the shared memory array
if (seq_copied[seq_1_index] == false) { //here is the condition that I need to check with only one thread
seq_copied[seq_1_index] = true; //and this is the write that should be written by only one thread
printf("copy seq_shared seq_1_index = %d, block = %d n", seq_1_index, blockIdx.x);
}
What is happening now is that all threads in the warp executes these exact sequence of instructions, thus the remaining code in the if condition is executed 32 times. I need to execute it only once.
How can I achieve that ?
cuda shared-memory atomic
cuda shared-memory atomic
edited Nov 21 at 18:18
Robert Crovella
93.1k4101143
93.1k4101143
asked Nov 18 at 23:19
Amani Elsaed
551519
551519
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
3
down vote
accepted
You can use atomicCAS()
for this. It does an atomic Compare-And-Swap operation.
This function will test a variable, and if it matches a certain condition (say, false) it will replace it with another value (say, true). It will do all these things atomically, i.e. without the possibility of interruption.
The return value of the atomic function gives us useful information in this case. If the return value is false for the above example, then we can be certain that it was replaced with true. We can also be certain that we were the "first" thread to run into this condition, and all other threads doing a similar operation will have a return value of true, not false.
Here's a worked example:
$ cat t327.cu
#include <stdio.h>
__global__ void k(){
__shared__ int flag;
if (threadIdx.x == 0) flag = 0;
__syncthreads();
int retval = atomicCAS(&flag, 0, 1);
printf("thread %d saw flag as %dn", threadIdx.x, retval);
// could do if statement on retval here
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 1
thread 3 saw flag as 1
thread 4 saw flag as 1
thread 5 saw flag as 1
thread 6 saw flag as 1
thread 7 saw flag as 1
thread 8 saw flag as 1
thread 9 saw flag as 1
thread 10 saw flag as 1
thread 11 saw flag as 1
thread 12 saw flag as 1
thread 13 saw flag as 1
thread 14 saw flag as 1
thread 15 saw flag as 1
thread 16 saw flag as 1
thread 17 saw flag as 1
thread 18 saw flag as 1
thread 19 saw flag as 1
thread 20 saw flag as 1
thread 21 saw flag as 1
thread 22 saw flag as 1
thread 23 saw flag as 1
thread 24 saw flag as 1
thread 25 saw flag as 1
thread 26 saw flag as 1
thread 27 saw flag as 1
thread 28 saw flag as 1
thread 29 saw flag as 1
thread 30 saw flag as 1
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
Responding to a question in the comments, we could extend this to a char
sized flag by creating an arbitrary atomic operation modeled after the double atomicAdd()
function given in the programming guide. The basic idea is that we will perform an atomicCAS using a supported data size (e.g. unsigned
) and we will convert the needed operation to effectively support a char
size. This is done by converting the char
address to a suitably-aligned unsigned
address, and then doing shifting of the char
quantity to line up in the appropriate byte position in the unsigned
value.
Here is a worked example:
$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomicCAS(char *addr, char cmp, char val){
unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
unsigned mask = 0xFFU;
mask <<= al_offset;
mask = ~mask;
unsigned sval = val;
sval <<= al_offset;
unsigned old = *al_addr, assumed, setval;
do {
assumed = old;
setval = assumed & mask;
setval |= sval;
old = atomicCAS(al_addr, assumed, setval);
} while (assumed != old);
return (char) ((assumed >> al_offset) & 0xFFU);
}
__global__ void k(){
__shared__ char flag[1024];
flag[threadIdx.x] = 0;
__syncthreads();
int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1);
printf("thread %d saw flag as %dn", threadIdx.x, retval);
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
The above presents a generalized atomicCAS
for char
size. This would allow you to swap any char
value for any other char
value. In your specific case, if you only need effectively a boolean flag, you can make this operation more efficient using atomicOr
as already mentioned in the comments. The use of the atomicOr
would allow you to eliminate the loop in the custom atomic function above. Here is a worked example:
$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomic_flag(char *addr){
unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
unsigned my_bit = 1U << al_offset;
return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU);
}
__global__ void k(){
__shared__ char flag[1024];
flag[threadIdx.x] = 0;
__syncthreads();
int retval = my_char_atomic_flag(flag+(threadIdx.x>>1));
printf("thread %d saw flag as %dn", threadIdx.x, retval);
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
These char
atomic methods assume that you have allocated a char
array whose size is a multiple of 4. It would not be valid to do this with a char
array of size 3 (and only 3 threads), for example.
Great. It works correctly. But how can I apply it to use 'char' ? It accepts only int values.
– Amani Elsaed
Nov 20 at 2:12
You can use atomicOr() to reduce memory consumption even down to one bit per flag - but you'll have to fiddle with with bitmasks and array offsets.
– tera
Nov 20 at 5:37
@Robert thanks for your detailed answer. It helped me to achieve my target and it works fine now. Thanks !!
– Amani Elsaed
Nov 25 at 0:18
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
accepted
You can use atomicCAS()
for this. It does an atomic Compare-And-Swap operation.
This function will test a variable, and if it matches a certain condition (say, false) it will replace it with another value (say, true). It will do all these things atomically, i.e. without the possibility of interruption.
The return value of the atomic function gives us useful information in this case. If the return value is false for the above example, then we can be certain that it was replaced with true. We can also be certain that we were the "first" thread to run into this condition, and all other threads doing a similar operation will have a return value of true, not false.
Here's a worked example:
$ cat t327.cu
#include <stdio.h>
__global__ void k(){
__shared__ int flag;
if (threadIdx.x == 0) flag = 0;
__syncthreads();
int retval = atomicCAS(&flag, 0, 1);
printf("thread %d saw flag as %dn", threadIdx.x, retval);
// could do if statement on retval here
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 1
thread 3 saw flag as 1
thread 4 saw flag as 1
thread 5 saw flag as 1
thread 6 saw flag as 1
thread 7 saw flag as 1
thread 8 saw flag as 1
thread 9 saw flag as 1
thread 10 saw flag as 1
thread 11 saw flag as 1
thread 12 saw flag as 1
thread 13 saw flag as 1
thread 14 saw flag as 1
thread 15 saw flag as 1
thread 16 saw flag as 1
thread 17 saw flag as 1
thread 18 saw flag as 1
thread 19 saw flag as 1
thread 20 saw flag as 1
thread 21 saw flag as 1
thread 22 saw flag as 1
thread 23 saw flag as 1
thread 24 saw flag as 1
thread 25 saw flag as 1
thread 26 saw flag as 1
thread 27 saw flag as 1
thread 28 saw flag as 1
thread 29 saw flag as 1
thread 30 saw flag as 1
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
Responding to a question in the comments, we could extend this to a char
sized flag by creating an arbitrary atomic operation modeled after the double atomicAdd()
function given in the programming guide. The basic idea is that we will perform an atomicCAS using a supported data size (e.g. unsigned
) and we will convert the needed operation to effectively support a char
size. This is done by converting the char
address to a suitably-aligned unsigned
address, and then doing shifting of the char
quantity to line up in the appropriate byte position in the unsigned
value.
Here is a worked example:
$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomicCAS(char *addr, char cmp, char val){
unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
unsigned mask = 0xFFU;
mask <<= al_offset;
mask = ~mask;
unsigned sval = val;
sval <<= al_offset;
unsigned old = *al_addr, assumed, setval;
do {
assumed = old;
setval = assumed & mask;
setval |= sval;
old = atomicCAS(al_addr, assumed, setval);
} while (assumed != old);
return (char) ((assumed >> al_offset) & 0xFFU);
}
__global__ void k(){
__shared__ char flag[1024];
flag[threadIdx.x] = 0;
__syncthreads();
int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1);
printf("thread %d saw flag as %dn", threadIdx.x, retval);
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
The above presents a generalized atomicCAS
for char
size. This would allow you to swap any char
value for any other char
value. In your specific case, if you only need effectively a boolean flag, you can make this operation more efficient using atomicOr
as already mentioned in the comments. The use of the atomicOr
would allow you to eliminate the loop in the custom atomic function above. Here is a worked example:
$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomic_flag(char *addr){
unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
unsigned my_bit = 1U << al_offset;
return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU);
}
__global__ void k(){
__shared__ char flag[1024];
flag[threadIdx.x] = 0;
__syncthreads();
int retval = my_char_atomic_flag(flag+(threadIdx.x>>1));
printf("thread %d saw flag as %dn", threadIdx.x, retval);
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
These char
atomic methods assume that you have allocated a char
array whose size is a multiple of 4. It would not be valid to do this with a char
array of size 3 (and only 3 threads), for example.
Great. It works correctly. But how can I apply it to use 'char' ? It accepts only int values.
– Amani Elsaed
Nov 20 at 2:12
You can use atomicOr() to reduce memory consumption even down to one bit per flag - but you'll have to fiddle with with bitmasks and array offsets.
– tera
Nov 20 at 5:37
@Robert thanks for your detailed answer. It helped me to achieve my target and it works fine now. Thanks !!
– Amani Elsaed
Nov 25 at 0:18
add a comment |
up vote
3
down vote
accepted
You can use atomicCAS()
for this. It does an atomic Compare-And-Swap operation.
This function will test a variable, and if it matches a certain condition (say, false) it will replace it with another value (say, true). It will do all these things atomically, i.e. without the possibility of interruption.
The return value of the atomic function gives us useful information in this case. If the return value is false for the above example, then we can be certain that it was replaced with true. We can also be certain that we were the "first" thread to run into this condition, and all other threads doing a similar operation will have a return value of true, not false.
Here's a worked example:
$ cat t327.cu
#include <stdio.h>
__global__ void k(){
__shared__ int flag;
if (threadIdx.x == 0) flag = 0;
__syncthreads();
int retval = atomicCAS(&flag, 0, 1);
printf("thread %d saw flag as %dn", threadIdx.x, retval);
// could do if statement on retval here
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 1
thread 3 saw flag as 1
thread 4 saw flag as 1
thread 5 saw flag as 1
thread 6 saw flag as 1
thread 7 saw flag as 1
thread 8 saw flag as 1
thread 9 saw flag as 1
thread 10 saw flag as 1
thread 11 saw flag as 1
thread 12 saw flag as 1
thread 13 saw flag as 1
thread 14 saw flag as 1
thread 15 saw flag as 1
thread 16 saw flag as 1
thread 17 saw flag as 1
thread 18 saw flag as 1
thread 19 saw flag as 1
thread 20 saw flag as 1
thread 21 saw flag as 1
thread 22 saw flag as 1
thread 23 saw flag as 1
thread 24 saw flag as 1
thread 25 saw flag as 1
thread 26 saw flag as 1
thread 27 saw flag as 1
thread 28 saw flag as 1
thread 29 saw flag as 1
thread 30 saw flag as 1
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
Responding to a question in the comments, we could extend this to a char
sized flag by creating an arbitrary atomic operation modeled after the double atomicAdd()
function given in the programming guide. The basic idea is that we will perform an atomicCAS using a supported data size (e.g. unsigned
) and we will convert the needed operation to effectively support a char
size. This is done by converting the char
address to a suitably-aligned unsigned
address, and then doing shifting of the char
quantity to line up in the appropriate byte position in the unsigned
value.
Here is a worked example:
$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomicCAS(char *addr, char cmp, char val){
unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
unsigned mask = 0xFFU;
mask <<= al_offset;
mask = ~mask;
unsigned sval = val;
sval <<= al_offset;
unsigned old = *al_addr, assumed, setval;
do {
assumed = old;
setval = assumed & mask;
setval |= sval;
old = atomicCAS(al_addr, assumed, setval);
} while (assumed != old);
return (char) ((assumed >> al_offset) & 0xFFU);
}
__global__ void k(){
__shared__ char flag[1024];
flag[threadIdx.x] = 0;
__syncthreads();
int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1);
printf("thread %d saw flag as %dn", threadIdx.x, retval);
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
The above presents a generalized atomicCAS
for char
size. This would allow you to swap any char
value for any other char
value. In your specific case, if you only need effectively a boolean flag, you can make this operation more efficient using atomicOr
as already mentioned in the comments. The use of the atomicOr
would allow you to eliminate the loop in the custom atomic function above. Here is a worked example:
$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomic_flag(char *addr){
unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
unsigned my_bit = 1U << al_offset;
return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU);
}
__global__ void k(){
__shared__ char flag[1024];
flag[threadIdx.x] = 0;
__syncthreads();
int retval = my_char_atomic_flag(flag+(threadIdx.x>>1));
printf("thread %d saw flag as %dn", threadIdx.x, retval);
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
These char
atomic methods assume that you have allocated a char
array whose size is a multiple of 4. It would not be valid to do this with a char
array of size 3 (and only 3 threads), for example.
Great. It works correctly. But how can I apply it to use 'char' ? It accepts only int values.
– Amani Elsaed
Nov 20 at 2:12
You can use atomicOr() to reduce memory consumption even down to one bit per flag - but you'll have to fiddle with with bitmasks and array offsets.
– tera
Nov 20 at 5:37
@Robert thanks for your detailed answer. It helped me to achieve my target and it works fine now. Thanks !!
– Amani Elsaed
Nov 25 at 0:18
add a comment |
up vote
3
down vote
accepted
up vote
3
down vote
accepted
You can use atomicCAS()
for this. It does an atomic Compare-And-Swap operation.
This function will test a variable, and if it matches a certain condition (say, false) it will replace it with another value (say, true). It will do all these things atomically, i.e. without the possibility of interruption.
The return value of the atomic function gives us useful information in this case. If the return value is false for the above example, then we can be certain that it was replaced with true. We can also be certain that we were the "first" thread to run into this condition, and all other threads doing a similar operation will have a return value of true, not false.
Here's a worked example:
$ cat t327.cu
#include <stdio.h>
__global__ void k(){
__shared__ int flag;
if (threadIdx.x == 0) flag = 0;
__syncthreads();
int retval = atomicCAS(&flag, 0, 1);
printf("thread %d saw flag as %dn", threadIdx.x, retval);
// could do if statement on retval here
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 1
thread 3 saw flag as 1
thread 4 saw flag as 1
thread 5 saw flag as 1
thread 6 saw flag as 1
thread 7 saw flag as 1
thread 8 saw flag as 1
thread 9 saw flag as 1
thread 10 saw flag as 1
thread 11 saw flag as 1
thread 12 saw flag as 1
thread 13 saw flag as 1
thread 14 saw flag as 1
thread 15 saw flag as 1
thread 16 saw flag as 1
thread 17 saw flag as 1
thread 18 saw flag as 1
thread 19 saw flag as 1
thread 20 saw flag as 1
thread 21 saw flag as 1
thread 22 saw flag as 1
thread 23 saw flag as 1
thread 24 saw flag as 1
thread 25 saw flag as 1
thread 26 saw flag as 1
thread 27 saw flag as 1
thread 28 saw flag as 1
thread 29 saw flag as 1
thread 30 saw flag as 1
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
Responding to a question in the comments, we could extend this to a char
sized flag by creating an arbitrary atomic operation modeled after the double atomicAdd()
function given in the programming guide. The basic idea is that we will perform an atomicCAS using a supported data size (e.g. unsigned
) and we will convert the needed operation to effectively support a char
size. This is done by converting the char
address to a suitably-aligned unsigned
address, and then doing shifting of the char
quantity to line up in the appropriate byte position in the unsigned
value.
Here is a worked example:
$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomicCAS(char *addr, char cmp, char val){
unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
unsigned mask = 0xFFU;
mask <<= al_offset;
mask = ~mask;
unsigned sval = val;
sval <<= al_offset;
unsigned old = *al_addr, assumed, setval;
do {
assumed = old;
setval = assumed & mask;
setval |= sval;
old = atomicCAS(al_addr, assumed, setval);
} while (assumed != old);
return (char) ((assumed >> al_offset) & 0xFFU);
}
__global__ void k(){
__shared__ char flag[1024];
flag[threadIdx.x] = 0;
__syncthreads();
int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1);
printf("thread %d saw flag as %dn", threadIdx.x, retval);
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
The above presents a generalized atomicCAS
for char
size. This would allow you to swap any char
value for any other char
value. In your specific case, if you only need effectively a boolean flag, you can make this operation more efficient using atomicOr
as already mentioned in the comments. The use of the atomicOr
would allow you to eliminate the loop in the custom atomic function above. Here is a worked example:
$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomic_flag(char *addr){
unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
unsigned my_bit = 1U << al_offset;
return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU);
}
__global__ void k(){
__shared__ char flag[1024];
flag[threadIdx.x] = 0;
__syncthreads();
int retval = my_char_atomic_flag(flag+(threadIdx.x>>1));
printf("thread %d saw flag as %dn", threadIdx.x, retval);
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
These char
atomic methods assume that you have allocated a char
array whose size is a multiple of 4. It would not be valid to do this with a char
array of size 3 (and only 3 threads), for example.
You can use atomicCAS()
for this. It does an atomic Compare-And-Swap operation.
This function will test a variable, and if it matches a certain condition (say, false) it will replace it with another value (say, true). It will do all these things atomically, i.e. without the possibility of interruption.
The return value of the atomic function gives us useful information in this case. If the return value is false for the above example, then we can be certain that it was replaced with true. We can also be certain that we were the "first" thread to run into this condition, and all other threads doing a similar operation will have a return value of true, not false.
Here's a worked example:
$ cat t327.cu
#include <stdio.h>
__global__ void k(){
__shared__ int flag;
if (threadIdx.x == 0) flag = 0;
__syncthreads();
int retval = atomicCAS(&flag, 0, 1);
printf("thread %d saw flag as %dn", threadIdx.x, retval);
// could do if statement on retval here
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 1
thread 3 saw flag as 1
thread 4 saw flag as 1
thread 5 saw flag as 1
thread 6 saw flag as 1
thread 7 saw flag as 1
thread 8 saw flag as 1
thread 9 saw flag as 1
thread 10 saw flag as 1
thread 11 saw flag as 1
thread 12 saw flag as 1
thread 13 saw flag as 1
thread 14 saw flag as 1
thread 15 saw flag as 1
thread 16 saw flag as 1
thread 17 saw flag as 1
thread 18 saw flag as 1
thread 19 saw flag as 1
thread 20 saw flag as 1
thread 21 saw flag as 1
thread 22 saw flag as 1
thread 23 saw flag as 1
thread 24 saw flag as 1
thread 25 saw flag as 1
thread 26 saw flag as 1
thread 27 saw flag as 1
thread 28 saw flag as 1
thread 29 saw flag as 1
thread 30 saw flag as 1
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
Responding to a question in the comments, we could extend this to a char
sized flag by creating an arbitrary atomic operation modeled after the double atomicAdd()
function given in the programming guide. The basic idea is that we will perform an atomicCAS using a supported data size (e.g. unsigned
) and we will convert the needed operation to effectively support a char
size. This is done by converting the char
address to a suitably-aligned unsigned
address, and then doing shifting of the char
quantity to line up in the appropriate byte position in the unsigned
value.
Here is a worked example:
$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomicCAS(char *addr, char cmp, char val){
unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
unsigned mask = 0xFFU;
mask <<= al_offset;
mask = ~mask;
unsigned sval = val;
sval <<= al_offset;
unsigned old = *al_addr, assumed, setval;
do {
assumed = old;
setval = assumed & mask;
setval |= sval;
old = atomicCAS(al_addr, assumed, setval);
} while (assumed != old);
return (char) ((assumed >> al_offset) & 0xFFU);
}
__global__ void k(){
__shared__ char flag[1024];
flag[threadIdx.x] = 0;
__syncthreads();
int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1);
printf("thread %d saw flag as %dn", threadIdx.x, retval);
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
The above presents a generalized atomicCAS
for char
size. This would allow you to swap any char
value for any other char
value. In your specific case, if you only need effectively a boolean flag, you can make this operation more efficient using atomicOr
as already mentioned in the comments. The use of the atomicOr
would allow you to eliminate the loop in the custom atomic function above. Here is a worked example:
$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomic_flag(char *addr){
unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
unsigned my_bit = 1U << al_offset;
return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU);
}
__global__ void k(){
__shared__ char flag[1024];
flag[threadIdx.x] = 0;
__syncthreads();
int retval = my_char_atomic_flag(flag+(threadIdx.x>>1));
printf("thread %d saw flag as %dn", threadIdx.x, retval);
}
int main(){
k<<<1,32>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
These char
atomic methods assume that you have allocated a char
array whose size is a multiple of 4. It would not be valid to do this with a char
array of size 3 (and only 3 threads), for example.
edited Nov 21 at 4:35
answered Nov 19 at 17:46
Robert Crovella
93.1k4101143
93.1k4101143
Great. It works correctly. But how can I apply it to use 'char' ? It accepts only int values.
– Amani Elsaed
Nov 20 at 2:12
You can use atomicOr() to reduce memory consumption even down to one bit per flag - but you'll have to fiddle with with bitmasks and array offsets.
– tera
Nov 20 at 5:37
@Robert thanks for your detailed answer. It helped me to achieve my target and it works fine now. Thanks !!
– Amani Elsaed
Nov 25 at 0:18
add a comment |
Great. It works correctly. But how can I apply it to use 'char' ? It accepts only int values.
– Amani Elsaed
Nov 20 at 2:12
You can use atomicOr() to reduce memory consumption even down to one bit per flag - but you'll have to fiddle with with bitmasks and array offsets.
– tera
Nov 20 at 5:37
@Robert thanks for your detailed answer. It helped me to achieve my target and it works fine now. Thanks !!
– Amani Elsaed
Nov 25 at 0:18
Great. It works correctly. But how can I apply it to use 'char' ? It accepts only int values.
– Amani Elsaed
Nov 20 at 2:12
Great. It works correctly. But how can I apply it to use 'char' ? It accepts only int values.
– Amani Elsaed
Nov 20 at 2:12
You can use atomicOr() to reduce memory consumption even down to one bit per flag - but you'll have to fiddle with with bitmasks and array offsets.
– tera
Nov 20 at 5:37
You can use atomicOr() to reduce memory consumption even down to one bit per flag - but you'll have to fiddle with with bitmasks and array offsets.
– tera
Nov 20 at 5:37
@Robert thanks for your detailed answer. It helped me to achieve my target and it works fine now. Thanks !!
– Amani Elsaed
Nov 25 at 0:18
@Robert thanks for your detailed answer. It helped me to achieve my target and it works fine now. Thanks !!
– Amani Elsaed
Nov 25 at 0:18
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53366421%2fcuda-byte-atomic-operation-to-cause-only-one-thread-to-act%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown