Modern supercomputers often use Graphic Processing Units (or GPUs) to meet the evergrowing demands for energy efficient high performance computing. GPUs have a complex memory architecture with various types of memories and caches, in particular global memory, shared memory, constant memory, and texture memory. Data placement optimization, i.e. optimizing the placement of data among these different memories, has a significant impact on the performance of HPC applications running on early generations of GPUs. However, newer generations of GPUs implement the same high-level memory hierarchy differently and have new memory features.
In this paper, we design a set of experiments to explore the relevance of data placement optimizations on several generations of NVIDIA GPUs, including Kepler, Maxwell, Pascal, and Volta. Our experiments include a set of memory microbenchmarks, CUDA kernels and a proxy application. The experiments are configured to include different CUDA thread blocks, data input sizes, and data placement choices. The results show that newer generations of GPUs are less sensitive to data placement optimization compared to older ones, mostly due to improvements to global memory caches.