Skip navigation.

Optimizing Memory Access Patterns for Cellular Automata on GPUs

Publication Type:

Book Chapter


GPU Computing Gems, Volume 2 (2011)


A cellular automaton (CA) is a discrete mathematical model used to calculate
the global behavior of a complex system using (ideally) simple local
rules. The space of interest is tessellated into a grid of cells and the behavior
of each cell is captured in state variables whose values at any instant are
functions of a small neighborhood around the cell. The dynamic behavior
of the system is modeled by the evolution of cell states, which are computed
repeatedly for all cells in discrete time steps. CA-based models are highly
parallelizable as, in each time step, the new state is determined completely
by the neighborhood state in the previous time step. When parallelized,
these calculations are generally memory-bound since the number of arithmetic
operations performed per memory access is relatively small (i.e., the
arithmetic intensity is low). In this article, we present a series of progressively
sophisticated techniques for improving the memory access patterns in
CA-based calculations, and hence the overall performance. Along the way,
we utilize some common techniques encountered when developing CUDA
code including: the use of shared memory, memory alignment, minimizing
halo bytes, and ways to increase the computational intensity.