GPU MEMORY IN DX12

To begin, the CPU with integrated graphics must be considered as well as a pure CPU and GPU as discrete devices. The R5 2400G processor uses DDR4 to provide a VRAM for the Vega 11 graphics.

GPUs don’t support page-faulting, so applications must commit data into physical memory while the GPU could access it. The APU which uses system memory does have page fault capability for virtual memory. In D3D12, most API objects encapsulate some amount of GPU-accessible memory. That GPU-accessible memory is made resident during the creation of the API object, and evicted on API object destruction.

DX12 applications are encouraged to use a reservation to denote the amount of memory they cannot go without. Ideally, the user-specified “low” graphics settings, or something even lower, is the right value for such a reservation. Setting a reservation won’t ever give an application a higher budget than it would normally receive. Instead, the reservation information helps the OS kernel quickly minimize the impact of large memory pressure situations. Even the reservation is not guaranteed to be available to the application when the application isn’t the foreground application.

Modern graphics cards support a tiled resource tier. The minimum tier 1 is adequate for low graphics. When available, they offer the most advanced residency management techniques available; but not all adapters currently support them. They enable remapping a resource without requiring regeneration of resource descriptors, partial mip level residency, and sparse texture scenarios, etc. Not all resources types are supported even when reserved resources are available, so a fully general page-based residency manager isn’t yet feasible.

Given that video card vendors marketing leads to feature levels really complicated the design of a game engine. The Windows 10 Creators Update enables developers to influence which heaps and resources will be prefered to stay resident when memory pressure requires that some of its resources be demoted. This helps developers create better performing applications by leveraging knowledge that the runtime can’t infer from API usage. Its expected that developers will become more comfortable and capable specifying priorities as they transition from using committed resources to reserves and tiled resources.

The Unreal 4 engine supports DX11 and DX12 and the Crytek engine also supports DX11 and DX12. The respective development teams spend a great deal of time with the DX11 and DX12 manuals which are mostly handled by MSDN manuals.

// Create an 11 device wrapped around the 12 device and share
// 12's command queue.
ComPtr d3d11Device;
ThrowIfFailed(D3D11On12CreateDevice(
m_d3d12Device.Get(),
d3d11DeviceFlags,
nullptr,
0,
reinterpret_cast(m_commandQueue.GetAddressOf()),
1,
0,
&d3d11Device,
&m_d3d11DeviceContext,
nullptr
)); 
// Query the 11On12 device from the 11 device. ThrowIfFailed(d3d11Device.As(&m_d3d11On12Device));

This code fragment shows how to handle a 2D setup to more carefully avoid running across a very old graphics card. This is the idea of feature level. By using a DX11 device it can gracefully abort of feature 11 is unavailable.

// d3dcommon.h
D3D_FEATURE_LEVEL_1_0_CORE = 0x1000

Game development is hard work but a seasoned Win32 developer who has struggled with DX3 onwards only knows to well how hard it is to keep ahead.

class Vector {
private:
  void *d_p;
  size_t alloc_sz, reserve_sz;
public:
  Vector() : d_p(NULL), alloc_sz(0), reserve_sz(0) {}
  // Reserves some extra space in order to speed up grow()
  CUresult reserve(size_t new_sz);
  // Actually commits num bytes of additional memory
  CUresult grow(size_t new_sz);
  // Frees up all the associated resources.
  ~Vector();
}; 

This CUDA vector object is able to grow with available resources.

CUresult Vector::reserve(size_t new_sz) {  if (new_sz > reserve_sz) {
    void *new_ptr = nullptr;
#ifndef USE_MANAGED_MEMORY
    cudaMalloc(&new_ptr, new_sz);
#else
    cudaMallocManaged(&new_ptr, new_sz);
#endif
    cudaMemcpy(new_ptr, d_p, alloc_sz);
    cudaFree(d_p);
    d_p = new_ptr;
    reserve_sz = new_sz;
  }
}
CUresult Vector::grow(size_t new_sz) {
  Vector::reserve(alloc_sz + new_sz);
#ifdef USE_MANAGED_MEMORY
    cudaPrefetchAsync(d_p + alloc_sz, num, dev);
#endif
  alloc_sz += new_sz;
}

Vector::~Vector() {
  if (d_p) cudaFree(d_p);
} 

Depending on the available video card memory this vector may not be able to completely leverage the video card memory for rendering needs,

  • The cudaMalloc function allocates more than what it needs to grow the allocation. To grow, you need to keep the old allocation and allocate a new allocation with enough room for the old allocation and the extra space, which significantly reduces how much you can grow. If the device only has 2GB of memory and you already have a 1GB vector, you can’t grow it any larger, as you would need 1GB plus how much you need to grow. Effectively, you can’t grow a vector that is larger than half of the GPU memory.
  • Each allocation must be mapped to all peer contexts, even if it is never used in those peer contexts.
  • The cudaMemcpy call adds latency to the growing request and uses precious memory bandwidth to duplicate data. This bandwidth could be better spent elsewhere.
  • The cudaFree call waits for all pending work on the current context (and all the peer contexts as well) before proceeding.

The primary use for D3D11_USAGE_STAGING is as a way to load data into other D3D11_USAGE_DEFAULT pool resources. Another common usage is for ‘readback’ of a render target to CPU accessible memory. You can use CopySubResourceRegion to move data between DEFAULT and STAGING resources (discrete hardware often uses Direct Memory Access to handle the moving of data between system memory and VRAM).