CUDA Memory Management in Video Applications

frame_pool; for (int i = 0; i < max_frames; ++i) { uint8_t* d_buf; cudaMalloc(&d_buf, frame_size); frame_pool.push_back(d_buf); } Explanation : Track usage via flags or circular indexing. Prevents GPU heap fragmentation and allocation overhead. Zero-Copy Buffering Zero-copy allows CUDA kernels to access host memory directly without explicit host-to-device copies. This is achieved by mapping pinned host memory into the device address space or by using EGL images for interoperability between CUDA and other APIs (such as OpenGL or video decoders) cudaHostAlloc(&host_buf, size, cudaHostAllocMapped); cudaHostGetDevicePointer(&dev_ptr, host_buf, 0); Explanation : host_buf is page-locked system RAM. dev_ptr is a device-accessible alias to the same memory. Suitable for use cases like live camera ingest or direct socket capture into GPU-accessible memory. Bandwidth is lower than device memory, but it eliminates copy overhead. Copying Data to/from GPU Efficient data transfer between the host and the device is crucial in video applications. Use cudaMemcpy2D() or cudaMemcpy2DAsync() to move frame data, as these functions handle pitch (row alignment) and allow for asynchronous operation Synchronous Copy Synchronous copies block the CPU thread until the entire transfer is complete. These copies occur in the default stream and ensure that all previously issued CUDA work (on all streams) is completed before and after the copy. cudaMemcpy2D(d_frame, pitch, h_frame, width, width, height, cudaMemcpyHostToDevice); Explanation : Use for simple or one-off transfers. Requires all previous GPU work to be completed before and after the copy. Asynchronous Copy Asynchronous copies allow memory transfers to proceed in parallel with kernel execution, provided they occur on different CUDA streams. This is critical for maximizing throughput in pipelined video processing, where decoding, filtering, and encoding happen simultaneously. cudaMemcpy2DAsync(d_frame, pitch, h_frame, width, width, height, cudaMemcpyHostToDevice, stream); Explanation : Use asynchronous copies to overlap data transfer and computation. Ensure that h_frame is allocated with cudaHostAlloc() or cudaHostRegister() for async transfer compatibility. Using Page-Locked (Pinned) Host Memory Pinned memory ensures that the host buffer is not paged out and enables DMA transfers to/from the GPU. Allocation Example uint8_t* h_pinned;cudaHostAlloc(&h_pinned, buffer_size, cudaHostAllocDefault); Explanation : cudaHostAllocDefault: Allocates memory for both synchronous and asynchronous access. Other flags include cudaHostAllocMapped for zero-copy and cudaHostAllocWriteCombined for write-optimized memory. Stream-Optimized Frame Processing To achieve full pipeline throughput, allocate multiple buffers and process them using separate CUDA streams. Example: cudaStream_t stream1, stream2; cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); // Decode in stream1 cudaMemcpyAsync(d_input, h_input, size, cudaMemcpyHostToDevice, stream1); // Process in stream2 my_kernel<<>>(d_input, d_output); Explanation : Synchronize streams only at critical points (e.g., encoder handoff). Assign one stream per stage (decode, process, encode) to decouple frame operations. Memory Cleanup At application shutdown, it is essential to free all allocated resources to prevent memory leaks and ensure a clean exit. cudaFree(d_frame); cudaFreeHost(h_pinned); cudaHostUnregister(h_buf); cudaStreamDestroy(stream1); Explanation : Always synchronize streams before freeing associated memory. For buffer pools, deallocate each cudaMalloc pointer in the pool vector. Use cudaDeviceReset() at application exit to release all remaining allocations (for debug/test only).

Boost Campaign Performance Through Video