Skip to content

Using Managed Memory in Any Fortran Compiler in OpenMP and OpenACC

Matt Norman edited this page Nov 22, 2019 · 2 revisions

Pros of Managed Memory

Managed memory is a very useful tool when porting a code to GPUs. In some cases, there are hundreds of kernels and hundreds of variables to manage, and tracking dependencies can be both burdensome and error prone. In many cases, the CUDA runtime will give faster simulations with managed memory than explicit data statements if only because it's hard for a human to track in, out, and inout dependencies over hundreds of kernels and variables. The CUDA runtime automatically minimized the amount of data movement for you, and it doesn't introduce bugs like you possibly could.

Limitations of Managed Memory

Managed memory will not give good performance in two cases:

  1. If you have intermittent CPU work that interrupts GPU kernels and touches the same data that's used in the GPU kernels, then that shuffling back and forth will be slower than if you put in data statements yourself
  2. If you do not prefetch your data, then it is not paged to the GPU until first touch in a GPU kernel. What this ultimately looks like in an nvprof profile is absurdly large kernel runtimes due to the latency of that fetch. But the profiler will not tell you that it's due to Managed Memory fetches because it doesn't have that information available. It just looks like kernels taking much longer than they should. The solution to this is to use cudaMemPrefetchAsync(...).

So long as your work is continuously on the GPU (no CPU interruptions touching data used in GPU kernels), and you prefetch your data upon allocation with cudaMallocManaged(), then there is absolutely no performance degradation with CUDA Managed Memory. In fact, for larger complex codes, if these two criteria are met, you many notice it doing a better job than you did with your explicit data statements.

OK, I'm sold. How do I use this magical secret sauce?

I'm glad you asked! It turns out you, in fact, can use this with GNU OpenACC and OpenMP offload, PGI OpenACC, and XL OpenMP offload. To start with, you need to stop using automatic Fortran arrays, and you need to turn them into allocatable arrays and allocate and deallocate them instead. This has the benefit that valgrind can now check that memory for incorrect memory accesses. But it's also the only way that data can be used in CUDA Managed Memory. The only data that should be in Fortran automatic arrays are small arrays that are not dimensioned the size of any of the loops. Essentially only "private" arrays should be automatic in Fortran.

Next, you'll need to change these arrays from allocatable to pointer, contiguous. The reason is that you're going to allocate them in C++ instead because Fortran doesn't have the capability to use CUDA Managed Memory unless you explicitly use CUDA Fortran, which has no path forward on future machines. To allocate in C++ and use it in Fortran, you'll have to use the iso_c_binding's c_f_pointer() routine to turn an allocation into a Fortran pointer. The c_f_pointer() routine basically adds the necessary Fortran array descriptors to make that allocation useful in Fortran. The point of the contiguous specifier is to help the compiler optimize use of the pointer by declaring it cannot point into strided memory (which Fortran quite unfortunately allows).

Next, you're going to have to change the allocate( var( 0:nx+1 , 0:ny+1 ) ) with a Fortran interface to a C++ routine such as gator_allocate(var,[0,0],[nx,ny]), assuming we're using the gator pool allocator with managed memory. What it's doing under the hood is:

module gator_mod
  use iso_c_binding
  [...]
  interface
    function gator_allocate_c( bytes ) result(ptr) bind(C, name="gatorAllocate")
      use iso_c_binding
      type(c_ptr)              :: ptr
      integer(c_size_t), value :: bytes
    end function gator_allocate_c
  end interface
  [...]
  interface gator_allocate
    [...]
    module procedure :: gator_allocate_real4_2d
    [...]
   end interface gator_allocate
contains
  [...]
  subroutine gator_allocate_real4_2d( arr , dims , lbounds_in )
    integer, parameter :: ndims = 2
    real   , pointer , intent(  out) :: arr       (:,:)
    integer          , intent(in   ) :: dims      (ndims)
    integer, optional, intent(in   ) :: lbounds_in(ndims)
    integer :: lbounds(ndims)
    type(c_ptr) :: data_ptr
    if (present(lbounds_in)) then
      lbounds = lbounds_in
    else
      lbounds = 1
    endif
    data_ptr = gator_allocate_c( int(product(dims)*sizeof(r4),c_size_t) )
    call c_f_pointer( data_ptr , arr , dims )
    arr(lbounds(1):,lbounds(2):) => arr
  end subroutine gator_allocate_real4_2d
  [...]
end module gator_mod
extern "C" void* gatorAllocate( size_t bytes ) {
  void* ptr;
  cudaMallocManaged(&ptr,bytes);
  cudaMemPrefetchAsync(ptr,bytes,0);
  cudaDeviceSynchronize();
  #ifdef _OPENMP45
    omp_target_associate_ptr(ptr,ptr,bytes,0,0);
  #endif
  #ifdef _OPENACC
    acc_map_data(ptr,ptr,bytes);
  #endif
  return ptr;
}

What this does is allocate the memory pool using cudaMallocManaged, then prefetch the entire pool to the GPU with cudaMemPrefetchAsync. Then the real magic happens. With only these two routines, the OpenMP and OpenACC runtimes have no idea that the CPU pointers being used in GPU kernels are using Managed Memory. Consequently, XL, PGI, and Cray runtimes will create their own GPU copies and completely ignore that you are using Managed Memory (thus, really bad performance). To fix this, we use omp_target_associate_ptr in OpenMP offload and acc_map_data in OpenACC to trick the OpenMP and OpenACC runtimes into thinking that the CPU pointer is the GPU pointer. And then, the runtime doesn't do anything with the data because it thinks the data is already present on the GPU (which, in fact, it truly is but through CUDA Managed Memory rather than the OpenMP and OpenACC interfaces).

Fortran Pointer Considerations

Not all arrays are created equal in Fortran. Different forms of Fortran arrays have different amounts of meta data associated with them that are handled differently by different compilers, and there is nothing in the Fortran standard that constrains how to handle these things. Pointers have the most metadata attached to them, contiguous pointers have slightly less metadata, assumed shape arrays have even less metadata, and finally explicitly sized and automatic arrays have the least metadata. When you run a pointer in a GPU kernel, even though the data being pointed to is on the GPU, there's no way to get the metadata on the GPU in current implementations, and that will have to be copied to the GPU, which adds a fair amount of time to smaller kernels.

The problem is that when you use your fancy new pointer, contiguous data in kernels, in nvprof you'll notice small amounts of data being copied to the GPU before each kernel (often multiple things being copied for each pointer used). That's the array descriptors being copied by the OpenMP runtime with cudaMemcpy for you under the hood. This isn't a problem for very large kernels, but rarely do we have that luxury in climate, where we strong scale to the point that kernels run as little as 1-10 microseconds.

To get rid of these array descriptor copies, you have to pass the pointers through a function or subroutine interface and accept them as normal Fortran arrays. For most compilers, accepting a pointer as an assumed shape array (i.e., arrayname(:,:,:)) will get rid of these extraneous copies. However, some compilers require an explicitly sized array to get rid of all copies to the GPU.

Summary

  • Change automatic arrays to pointer, contiguous, explicitly allocate, and explicitly deallocate
  • Change allocate() and deallocate() to Fortran bindings to C++ routines that do the allocations and deallocations for you
  • In the C++ allocation, use cudaMallocManaged(), cudaMemPrefetchAsync()
  • In OpenMP, also use omp_target_associate_ptr()
  • In OpenACC, also use acc_map_data()
  • cudaMallocManaged is very expensive, so ideally you should use a pool allocator, which pays this cost once for a large block of memory and then hands out chunks of that block at a much lower cost.

As an example of a simple code that does this, see Pool Alligator

PGI: Note that PGI can enable this by default with -ta=nvidia,managed, and it will turn on its own pool allocator. But if you want to use another compiler, at the moment, you'll need to use a custom C++ implementation with Fortran bindings.