CUDA Shared Memory

Memory types

NVIDIA GPU Memory

Global Memory

allocated and deallocated by the host, read/write access by all threads, large and slow but may be cached in some devices.(GPU Mem)

Shared Memory

Static allocation shared read-write access by all work items in a block. Much faster than local or global memory

Local Memory

Static allocation. read/write access by a single thread. Generally handled automatically by the compiler. Local memory is not a physical type of memory, but an abstraction of global memory. Its scope is local to the thread and it resides off-chip, which makes it as expensive to access as global memory. Local memory is used only to hold automatic variables. The compiler makes use of local memory when it determines that there is not enough register space to hold the variable. Automatic variables that are large structures or arrays are also typically placed in local memory.

Constant memory

Static allocation. read-only access by all work items. Constant memory is used for data that will not change over the course of a kernel execution and is read only. Using constant rather than global memory can reduce the required memory bandwidth, however, this performance gain can only be realized when a warp of threads read the same location.

Register Memory

In most cases, accessing a register consumes zero clock cycles per instruction. However, delays can occur due to read after write dependencies and bank conflicts. The latency of read after write dependencies is roughly 24 clock cycles.

Bank conflicts

shared memory被分为大小相等的banks(modules)，可以同时访问。但是，如果多个线程请求的地址映射到同一bank，则访问将被串行化。硬件根据需要将冲突的内存请求拆分为多个单独的无冲突请求，从而将有效带宽减少等于冲突内存请求数量的系数。当warp中的所有线程都寻址相同的shared memory地址时，可以广播（Compute Capability(下称CC) 2.0 及更高版本的设备具有多播shared memory访问的附加能力，这意味着warp内任意数量的线程对同一shared memory位置的多次访问是同时的）。

当不存在bank conflict时，shared memory的性能与寄存器相当。

Bank

Shared memory的组织方式是将连续的32-bit Word分配给连续的bank，并且每个bank的每个时钟周期带宽为32位。对于CC 1.x 的设备，warp 大小为 32 个线程，bank 数量为 16 个。对 warp 的请求分为一个对 warp 前半部分和一个对 warp 后半部分的请求。注意，如果每个bank的一个内存位置被半个warp的线程访问，则不会发生冲突。

对于CC 2.0 的设备，warp 大小为 32 个线程，bank 数量也为 32。对于CC 1.x 的设备，warp 的shared memory请求不会被分割，这意味着同一个warp线程之间可能会发生bank冲突而不再分前半warp和后半warp

通过函数 cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeEightByte) 可以把bank增加到8bytes大小. 可以用来在操作64位数据类型时避免conflict

Kepler architecture Shared Memory and L1 Cache

例1：

1 2	__shared__ float shm[32]; shm[threadIdx.x] = threadIdx.x;

shm[0]在第一个bank，shm[1]在第二个bank … shm[15]在第16个bank，shm[16] 又到了第一个bank，与thread 0 冲突。只有半个warp各访问一个位置时没有冲突。

例2：

1
2
3

__shared__ float shm[48];
int S = 3;
float data = shm[S*threaIdx.x]; // total 16 threads

各线程访问的bank都错开
example

例3：

1 2	__shared__ char shm[32]; char data = shm[threadIdx.x];

Conflict！因为一个bank32位，0,1,2,3这4个线程都访问的第一个bank

Solution

Memory Padding
改变内存地址模式

加padding很简单

1
2
3

const int TILE_H = 16;
const int TILE_W = 16;
__shared__ int shared[TILE_H][TILE_W] ;

Width + 1 变成

1	__shared__ int shared[TILE_H][TILE_W + 1 ] ;

增加的padding置0即可

假设block size 16x16，bank size 16，每个线程同时访问每行的一个数
数字表示bank序号，紫色的是每个线程所访问的元素，可以看到每个元素都同时访问同一个bank，发生conflict

增加padding后
因为多了一个元素，bank排列与要使用的16x16矩阵错开了一位，避免了bank conflict