__global__ declares kernel, which is called on host and executed on device __device__ declares device function, which is called and executed on device __host__ declares host function, which is called and executed on host __noinline__ to avoid inlining __forceinline__ to force inlining
__device__ declares device variable in global memory, accessible from all threads, with lifetime of application __constant__ declares device variable in constant memory, accessible from all threads, with lifetime of application __shared__ declares device varibale in block's shared memory, accessible from all threads within a block, with lifetime of block __restrict__ standard C definition that pointers are not aliased
dim3 gridDim; //dimensions of grid dim3 blockDim; //dimensions of block uint3 blockIdx; //block index within grid uint3 threadIdx; //thread index within block int warpSize; //number of threads in warp
dim3 blocks( nx, ny, nz ); // cuda 1.x has 1D and 2D grids, cuda 2.x adds 3D grids dim3 threadsPerBlock( mx, my, mz ); // cuda 1.x has 1D, 2D, and 3D blocks
__threadfence_block(); // wait until memory accesses are visible to block __threadfence(); // wait until memory accesses are visible to block and device __threadfence_system(); // wait until memory accesses are visible to block and device and host (2.x) __syncthreads(); // wait until all threads reach sync
// direction is one of cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost cudaMemcpy( dst_pointer, src_pointer, size, direction );
__constant__ float dev_data[n]; float host_data[n]; cudaMemcpyToSymbol ( dev_data, host_data, sizeof(host_data) ); // dev_data = host_data cudaMemcpyFromSymbol( host_data, dev_data, sizeof(host_data) ); // host_data = dev_data Also, malloc and free work inside a kernel(2.x), but memory allocated in a kernel must be deallocated in a kernel(not the host). It can be freed in a different kernel, though.
old = atomicAdd ( &addr, value ); // old = *addr; *addr += value old = atomicSub ( &addr, value ); // old = *addr; *addr –= value old = atomicExch( &addr, value ); // old = *addr; *addr = value
old = atomicMin ( &addr, value ); // old = *addr; *addr = min( old, value ) old = atomicMax ( &addr, value ); // old = *addr; *addr = max( old, value )
// increment up to value, then reset to 0 // decrement down to 0, then reset to value old = atomicInc ( &addr, value ); // old = *addr; *addr = ((old >= value) ? 0 : old+1 ) old = atomicDec ( &addr, value ); // old = *addr; *addr = ((old == 0) or (old > val) ? val : old–1 )
old = atomicAnd ( &addr, value ); // old = *addr; *addr &= value old = atomicOr ( &addr, value ); // old = *addr; *addr |= value old = atomicXor ( &addr, value ); // old = *addr; *addr ^= value
// compare-and-store old = atomicCAS ( &addr, compare, value ); // old = *addr; *addr = ((old == compare) ? value : old)
cuInit( 0 ); // takes flags for future use cuDeviceGetCount ( &cnt ); cuDeviceGet ( &dev, index ); cuDeviceGetName ( name, sizeof(name), dev ); cuDeviceComputeCapability( &major, &minor, dev ); cuDeviceTotalMem ( &bytes, dev ); cuDeviceGetProperties ( &properties, dev ); // max threads, etc.
BLAS functions have cublas prefix and first letter of usual BLAS function name is capitalized. Arguments are the same as standard BLAS, with these exceptions:
All functions add handle as first argument.
All functions return cublasStatus_t error code.
Constants alpha and beta are passed by pointer. All other scalars (n, incx, etc.) are bassed by value.
Functions that return a value, such as ddot, add result as last argument, and save value to result.
Constants are given in table above, instead of using characters.
Examples:
1 2
cublasDdot ( handle, n, x, incx, y, incy, &result ); // result = ddot( n, x, incx, y, incy ); cublasDaxpy( handle, n, &alpha, x, incx, y, incy ); // daxpy( n, alpha, x, incx, y, incy );
-v list compilation commands as they are executed
-dryrun list compilation commands, without executing
-keep saves intermediate files (e.g., pre-processed) for debugging
-clean removes output files (with same exact compiler options)
-arch=<compute_xy> generate PTX for capability x.y
-code=<sm_xy> generate binary for capability x.y, by default same as -arch
-gencode arch=…,code=… same as -arch and -code, but may be repeated
It makes most sense (to me) to give -arch a virtual architecture and -code a real architecture, though both flags accept both virtual and real architectures (at times).