Source code is in .cu files, which contain mixture of host (CPU) and device (GPU) code.
Declaring functions
1 2 3 4 5
__global__ declares kernel, which is called on host and executed on device __device__ declares device function, which is called and executed on device __host__ declares host function, which is called and executed on host __noinline__ to avoid inlining __forceinline__ to force inlining
Declaring variables
1 2 3 4
__device__ declares device variable in global memory, accessible from all threads, with lifetime of application __constant__ declares device variable in constant memory, accessible from all threads, with lifetime of application __shared__ declares device varibale in block's shared memory, accessible from all threads within a block, with lifetime of block __restrict__ standard C definition that pointers are not aliased
Types
Most routines return an error code of type cudaError_t.
dim3 Components are accessible as variable.x, variable.y, variable.z, variable.w. Constructor is make_<type>( x, ... ), for example: float2 xx = make_float2( 1., 2. ); dim3 can take 1, 2, or 3 argumetns: dim3 blocks1D( 5 ); dim3 blocks2D( 5, 5 ); dim3 blocks3D( 5, 5, 5 );
Pre-defined variables
1 2 3 4 5
dim3 gridDim; //dimensions of grid dim3 blockDim; //dimensions of block uint3 blockIdx; //block index within grid uint3 threadIdx; //thread index within block int warpSize; //number of threads in warp
Kernel invocation
1 2 3 4 5 6
__global__ voidkernel( ... ){ ... }
dim3 blocks( nx, ny, nz ); // cuda 1.x has 1D and 2D grids, cuda 2.x adds 3D grids dim3 threadsPerBlock( mx, my, mz ); // cuda 1.x has 1D, 2D, and 3D blocks
kernel<<< blocks, threadsPerBlock >>>( ... );
Thread management
1 2 3 4
__threadfence_block(); // wait until memory accesses are visible to block __threadfence(); // wait until memory accesses are visible to block and device __threadfence_system(); // wait until memory accesses are visible to block and device and host (2.x) __syncthreads(); // wait until all threads reach sync
// direction is one of cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost cudaMemcpy( dst_pointer, src_pointer, size, direction );
__constant__ float dev_data[n]; float host_data[n]; cudaMemcpyToSymbol ( dev_data, host_data, sizeof(host_data) ); // dev_data = host_data cudaMemcpyFromSymbol( host_data, dev_data, sizeof(host_data) ); // host_data = dev_data Also, malloc and free work inside a kernel(2.x), but memory allocated in a kernel must be deallocated in a kernel(not the host). It can be freed in a different kernel, though.
Atomic functions
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
old = atomicAdd ( &addr, value ); // old = *addr; *addr += value old = atomicSub ( &addr, value ); // old = *addr; *addr –= value old = atomicExch( &addr, value ); // old = *addr; *addr = value
old = atomicMin ( &addr, value ); // old = *addr; *addr = min( old, value ) old = atomicMax ( &addr, value ); // old = *addr; *addr = max( old, value )
// increment up to value, then reset to 0 // decrement down to 0, then reset to value old = atomicInc ( &addr, value ); // old = *addr; *addr = ((old >= value) ? 0 : old+1 ) old = atomicDec ( &addr, value ); // old = *addr; *addr = ((old == 0) or (old > val) ? val : old–1 )
old = atomicAnd ( &addr, value ); // old = *addr; *addr &= value old = atomicOr ( &addr, value ); // old = *addr; *addr |= value old = atomicXor ( &addr, value ); // old = *addr; *addr ^= value
// compare-and-store old = atomicCAS ( &addr, compare, value ); // old = *addr; *addr = ((old == compare) ? value : old)
Warp vote
1 2 3
int __all ( predicate ); int __any ( predicate ); int __ballot( predicate ); // nth thread sets nth bit to predicate
Timer
wall clock cycle counter
1
clock_tclock();
Texture
can also return float2 or float4, depending on texRef.
1 2 3 4 5 6 7 8 9 10
// integer index floattex1Dfetch( texRef, ix );
// float index floattex1D( texRef, x ); floattex2D( texRef, x, y ); floattex3D( texRef, x, y, z );
floattex1DLayered( texRef, x ); floattex2DLayered( texRef, x, y );
cuInit( 0 ); // takes flags for future use cuDeviceGetCount ( &cnt ); cuDeviceGet ( &dev, index ); cuDeviceGetName ( name, sizeof(name), dev ); cuDeviceComputeCapability( &major, &minor, dev ); cuDeviceTotalMem ( &bytes, dev ); cuDeviceGetProperties ( &properties, dev ); // max threads, etc.
cuBLAS
Matrices are column-major. Indices are 1-based; this affects result of iamax and iamin.
BLAS functions have cublas prefix and first letter of usual BLAS function name is capitalized. Arguments are the same as standard BLAS, with these exceptions:
All functions add handle as first argument.
All functions return cublasStatus_t error code.
Constants alpha and beta are passed by pointer. All other scalars (n, incx, etc.) are bassed by value.
Functions that return a value, such as ddot, add result as last argument, and save value to result.
Constants are given in table above, instead of using characters.
Examples:
1 2
cublasDdot ( handle, n, x, incx, y, incy, &result ); // result = ddot( n, x, incx, y, incy ); cublasDaxpy( handle, n, &alpha, x, incx, y, incy ); // daxpy( n, alpha, x, incx, y, incy );
Compiler
nvcc, often found in /usr/local/cuda/bin
Defines CUDACC
Flags common with cc
Short flag
Long flag
Output or Description
-c
–compile
.o object file
-E
–preprocess
on standard output
-M
–generate-dependencies
on standard output
-o file
–output-file file
-I directory
–include-path directory
header search path
-L directory
–library-path directory
library search path
-l lib
–library lib
link with library
-lib
generate library
-shared
generate shared library
-pg
–profile
for gprof
-g level
–debug level
-G
–device-debug
-O level
–optimize level
Undocumented (but in sample makefiles)
-m64 compile x86_64 host CPU code
Flags specific to nvcc
-v list compilation commands as they are executed
-dryrun list compilation commands, without executing
-keep saves intermediate files (e.g., pre-processed) for debugging
-clean removes output files (with same exact compiler options)
-arch=<compute_xy> generate PTX for capability x.y
-code=<sm_xy> generate binary for capability x.y, by default same as -arch
-gencode arch=…,code=… same as -arch and -code, but may be repeated
Argumenents for -arch and -code
It makes most sense (to me) to give -arch a virtual architecture and -code a real architecture, though both flags accept both virtual and real architectures (at times).