GPU acceleration of a petascale application for turbulent mixing at high 
Schmidt number using OpenMP 4.5

Clay, M. P.; Buaria, Dhawal; Yeung, P. K.; Gotoh, T.

doi:10.1016/j.cpc.2018.02.020

Local TagsRelease HistoryDetailsSummary

GPU acceleration of a petascale application for turbulent mixing at high Schmidt number using OpenMP 4.5

Clay, M. P., Buaria, D., Yeung, P. K., & Gotoh, T. (2018). GPU acceleration of a petascale application for turbulent mixing at high Schmidt number using OpenMP 4.5. Computer Physics Communications, 228, 100-114. doi:10.1016/j.cpc.2018.02.020.

Item is Released

show all hide all

Basic

show hide

Item Permalink: https://hdl.handle.net/21.11116/0000-0001-9509-D Version Permalink: https://hdl.handle.net/21.11116/0000-000C-FEEB-E

Genre: Journal Article

Files

show Files

Locators

show

Creators

show

hide

Creators:
Clay, M. P., Author
Buaria, Dhawal¹, Author
Yeung, P. K., Author
Gotoh, T., Author

Affiliations:
1Laboratory for Fluid Dynamics, Pattern Formation and Biocomplexity, Max Planck Institute for Dynamics and Self-Organization, Max Planck Society, ou_2063287

Content

show

hide

Free keywords: Turbulence; High Schmidt number; Compact finite differences; Asynchronous GPU computing; OpenMP 4.5; Titan (ORNL)

Abstract: This paper reports on the successful implementation of a massively parallel GPU-accelerated algorithm for the direct numerical simulation of turbulent mixing at high Schmidt number. The work stems from a recent development (Comput. Phys. Commun., vol. 219, 2017, 313-328), in which a low-communication algorithm was shown to attain high degrees of scalability on the Cray XE6 architecture when overlapping communication and computation via dedicated communication threads. An even higher level of performance has now been achieved using OpenMP 4.5 on the Cray XK7 architecture, where on each node the 16 integer cores of an AMD Interlagos processor share a single Nvidia K20X GPU accelerator. In the new algorithm, data movements are minimized by performing virtually all of the intensive scalar field computations in the form of combined compact finite difference (CCD) operations on the GPUs. A memory layout in departure from usual practices is found to provide much better performance for a specific kernel required to apply the CCD scheme. Asynchronous execution enabled by adding the OpenMP 4.5 NOWAIT clause to TARGET constructs improves scalability when used to overlap computation on the GPUs with computation and communication on the CPUs. On the 27-petaflops supercomputer Titan at Oak Ridge National Laboratory, USA, a GPU-to-CPU speedup factor of approximately 5 is consistently observed at the largest problem size of 81923 grid points for the scalar field computed with 8192 XK7 nodes.

Details

show

hide

Language(s): eng - English

Dates: Published Online: 2018-03-07Date issued: 2018-07

Publication Status: Issued

Pages: -

Publishing info: -

Table of Contents: -

Rev. Type: Peer

Identifiers: DOI: 10.1016/j.cpc.2018.02.020

Degree: -

Event

show

Legal Case

show

Project information

show

Source 1

show

hide

Title: Computer Physics Communications

Source Genre: Journal

Creator(s):

Affiliations:

Publ. Info: -

Pages: - Volume / Issue: 228 Sequence Number: - Start / End Page: 100 - 114 Identifier: -