The Clang implementation of OpenMP R 4.5 now
provides full support for the specification, offering the only
open source option for targeting NVIDIA R GPUs. While using
OpenMP allows portability across different architectures, matching
native CUDA R performance without major code restructuring
is an open research issue.
In order to analyze the current performance, we port a suite of
representative benchmarks, and the mature mini-apps TeaLeaf,
CloverLeaf, and SNAP to the Clang OpenMP 4.5 compiler.
We then collect performance results for those ports, and their
equivalent CUDA ports, on an NVIDIA Kepler GPU. Through
manual analysis of the generated code, we are able to discover
the root cause of the performance differences between OpenMP
and CUDA.
A number of improvements can be made to the existing
compiler implementation to enable performance that approaches
that of hand-optimized CUDA. Our first observation was that
the generated code did not use fused-multiply-add instructions,
which was resolved using an existing flag. Next we saw that the
compiler was not passing any loads through non-coherent cache,
and added a new flag to the compiler to assist with this problem.
We then observed that the compiler partitioning of threads
and teams could be improved upon for the majority of kernels,
which guided work to ensure that the compiler can pick more
optimal defaults. We uncovered a register allocation issue with
the existing implementation that, when fixed alongside the other
issues, enables performance that is close to CUDA.
Finally, we use some different kernels to emphasize that
support for managing memory hierarchies needs to be introduced
into the specification, and propose a simple option for
programming shared caches.