On modern hardware architectures, the performance of Flux Reconstruction
(FR) methods can be limited by memory bandwidth. In a typical implementation,
these methods are implemented as a chain of distinct kernels. Often, a
dataset which has just been written in the main memory by a kernel is read back
immediately by the next kernel. One way to avoid such a redundant expenditure
of memory bandwidth is kernel fusion. However, on a practical level kernel
fusion requires that the source for all kernels be available, thus preventing calls
to certain third-party library functions. Moreover, it can add substantial complexity
to a codebase. An alternative to full kernel fusion is cache blocking. But
for this to be effective, CPU cache has to be meaningfully big. Historically, size
of L1 and L2 caches prevented cache blocking for high-order CFD applications.
However in recent years, size of L2 cache has grown from around 0.25 MiB to 1.25
MiB, and made it possible to apply cache blocking for high-order CFD codes.
In this approach, kernels remain distinct, and are executed one after another on
small chunks of data that can fit in the cache, as opposed to on full datasets.
These chunks of data stay in the cache and whenever a kernel requests access
to data that is already in the cache, memory bandwidth is conserved. In this
study, a data structure that facilitates cache blocking is considered, and a range
of kernel grouping configurations for an FR based Euler solver are examined. A
theoretical study is conducted for hexahedral elements with no anti-aliasing at
p = 3 and p = 4 in order to determine the predicted performance of a few kernel grouping configurations. Then, these candidates are implemented in the PyFR solver and the performance gains in practice are compared with the theoretical
estimates that range between 2.05x and 2.50x. An inviscid Taylor-Green Vortex
test case is used as a benchmark, and the most performant configuration leads
to a speedup of approximately 2.81x in practice.Link: