Cache Oblivious Parallelograms in Iterative Stencil Computations

Strzodka, Robert; Shaheen, Mohammed; Pajak, Dawid; Seidel, Hans-Peter

doi:10.1145/1810085.1810096

Datensatz

DATENSATZ AKTIONENEXPORT

Zur Ablage hinzufügen

Bitte beachten Sie, dass eine neuere Version dieses Datensatzes verfügbar ist:
https://pure.mpg.de/pubman/item/item_1323962_3

DetailsÜbersicht

Freigegeben

Konferenzbeitrag

Cache Oblivious Parallelograms in Iterative Stencil Computations

MPG-Autoren

/persons/resource/persons45566

Strzodka, Robert
Computer Graphics, MPI for Informatics, Max Planck Society;
Graphics - Optics - Vision, MPI for Informatics, Max Planck Society;

/persons/resource/persons45463

Shaheen, Mohammed
Computer Graphics, MPI for Informatics, Max Planck Society;
International Max Planck Research School, MPI for Informatics, Max Planck Society;

/persons/resource/persons45154

Pajak, Dawid
Computer Graphics, MPI for Informatics, Max Planck Society;

/persons/resource/persons45449

Seidel, Hans-Peter
Computer Graphics, MPI for Informatics, Max Planck Society;

Externe Ressourcen

Es sind keine externen Ressourcen hinterlegt

Volltexte (beschränkter Zugriff)

Für Ihren IP-Bereich sind aktuell keine Volltexte freigegeben.

Volltexte (frei zugänglich)

Es sind keine frei zugänglichen Volltexte in PuRe verfügbar

Ergänzendes Material (frei zugänglich)

Es sind keine frei zugänglichen Ergänzenden Materialien verfügbar

Zitation

Strzodka, R., Shaheen, M., Pajak, D., & Seidel, H.-P. (2010). Cache Oblivious Parallelograms in Iterative Stencil Computations. In ICS'10 (pp. 49-59). New York, NY: ACM. doi:10.1145/1810085.1810096.

Zitierlink: https://hdl.handle.net/11858/00-001M-0000-000F-1742-0

Zusammenfassung

We present a new cache oblivious scheme for iterative stencil computations that performs beyond system bandwidth limitations as though gigabytes of data could reside in an enormous on-chip cache. We compare execution times for 2D and 3D spatial domains with up to 128 million double precision elements for constant and variable stencils against hand-optimized naive code and the automatic polyhedral parallelizer and locality optimizer PluTo and demonstrate the clear superiority of our results. The performance benefits stem from a tiling structure that caters for data locality, parallelism and vectorization simultaneously. Rather than tiling the iteration space from inside, we take an exterior approach with a predefined hierarchy, simple regular parallelogram tiles and a locality preserving parallelization. These advantages come at the cost of an irregular work-load distribution but a tightly integrated load-balancer ensures a high utilization of all resources.