Publications

BibTex format

@article{Akkurt:2022:10.1016/j.cpc.2021.108193,
author = {Akkurt, S and Witherden, F and Vincent, P},
doi = {10.1016/j.cpc.2021.108193},
journal = {Computer Physics Communications},
pages = {1--9},
title = {Cache blocking strategies applied to flux reconstruction},
url = {http://dx.doi.org/10.1016/j.cpc.2021.108193},
volume = {271},
year = {2022}
}

Download

RIS format (EndNote, RefMan)

TY - JOUR
AB - On modern hardware architectures, the performance of Flux Reconstruction (FR) methods can be limitedby memory bandwidth. In a typical implementation, these methods are implemented as a chain ofdistinct kernels. Often, a dataset which has just been written in the main memory by a kernel isread back immediately by the next kernel. One way to avoid such a redundant expenditure of memorybandwidth is kernel fusion. However, on a practical level kernel fusion requires that the source for allkernels be available, thus preventing calls to certain third-party library functions. Moreover, it can addsubstantial complexity to a codebase. An alternative to full kernel fusion is cache blocking. But for thisto be effective, CPU cache has to be meaningfully big. Historically, size of L1 and L2 caches preventedcache blocking for high-order CFD applications. However in recent years, size of L2 cache has grownfrom around 0.25 MiB to 1.25 MiB, and made it possible to apply cache blocking for high-order CFDcodes. In this approach, kernels remain distinct, and are executed one after another on small chunks ofdata that can fit in the cache, as opposed to on full datasets. These chunks of data stay in the cache andwhenever a kernel requests access to data that is already in the cache, memory bandwidth is conserved.In this study, a data structure that facilitates cache blocking is considered, and a range of kernel groupingconfigurations for an FR based Euler solver are examined. A theoretical study is conducted for hexahedralelements with no anti-aliasing at p = 3 and p = 4 in order to determine the predicted performance ofa few kernel grouping configurations. Then, these candidates are implemented in the PyFR solver andthe performance gains in practice are compared with the theoretical estimates that range between 2.05xand 2.50x. An inviscid Taylor-Green Vortex test case is used as a benchmark, and the most performantconfiguration leads to a speedup of approximately 2.81x in practice.
AU - Akkurt,S
AU - Witherden,F
AU - Vincent,P
DO - 10.1016/j.cpc.2021.108193
EP - 9
PY - 2022///
SN - 0010-4655
SP - 1
TI - Cache blocking strategies applied to flux reconstruction
T2 - Computer Physics Communications
UR - http://dx.doi.org/10.1016/j.cpc.2021.108193
UR - https://www.sciencedirect.com/science/article/pii/S0010465521003052?via%3Dihub
UR - http://hdl.handle.net/10044/1/99114
VL - 271
ER -

Download

Professor Peter Vincent

Contact

Location

Summary

Citation

BibTex format

RIS format (EndNote, RefMan)