GPU Profiling & Parallelizing xRAGE at LANL

Description

In June 2023, I began my internship at Los Alamos National Laboratory in parallel computing. My partner, Ivan Gonzalez, and I worked primarily with our mentor Dr. CJ Solomon and our supervisor Dr. Shane Fogerty. By the end of our 10-week internship, we successfully GPU-profiled, parallelized, and validated xRAGE code that XCP-2 will continue to develop!

After GPU-profiling xRAGE, I identified that the runtime was mostly spent by the CPU and GPU synchronizing memory in a method that called expensive Kokkos::parallel_for’s and Kokkos::parallel_reduce’s. In the end, we successfully implemented two different changes.

  1. Transforming the nested loops into a multi-dimensional array that allowed Kokkos to iterate over them using a MDRangePolicy

  2. Hierarchical parallelism with teams of threads.

I took the lead on the hierarchical parallelization method, which produced the most improvement:

  • 60x speedup on Darwin

  • 35x speedup on RZAnsel (with multi-process service disabled)

  • 23x speedup on RZAnsel (with multi-process service enabled)

I presented our results to the LANL XCP-2 division and at the Parallel Computing Summer Research Symposium 2023. Lastly, I wrote and published our findings in a Los Alamos Unlimited Release report, which you can find here. I’m grateful I had the opportunity to improve real computational physics code on numerous high performance computing systems.