Products
Products
Video Hosting
Upload and manage your videos in a centralized video library.
Image Hosting
Upload and manage all your images in a centralized library.
Galleries
Choose from 100+templates to showcase your media in style.
Video Messaging
Record, and send personalized video messages.
CincoTube
Create your own community video hub your team, students or fans.
Pages
Create dedicated webpages to share your videos and images.
Live
Create dedicated webpages to share your videos and images.
For Developers
Video API
Build a unique video experience.
DeepUploader
Collect and store user content from anywhere with our file uploader.
Solutions
Solutions
Enterprise
Supercharge your business with secure, internal communication.
Townhall
Webinars
Team Collaboration
Learning & Development
Creative Professionals
Get creative with a built in-suite of editing and marketing tools.
eCommerce
Boost sales with interactive video and easy-embedding.
Townhall
Webinars
Team Collaboration
Learning & Development
eLearning & Training
Host and share course materials in a centralized portal.
Sales & Marketing
Attract, engage and convert with interactive tools and analytics.
"Cincopa helped my Enterprise organization collaborate better through video."
Book a Demo
Resources
Resources
Blog
Learn about the latest industry trends, tips & tricks.
Help Centre
Get access to help articles FAQs, and all things Cincopa.
Partners
Check out our valued list of partners.
Product Updates
Stay up-to-date with our latest greatest features.
Ebooks, Guides & More
Customer Stories
Hear how we've helped businesses succeed.
Boost Campaign Performance Through Video
Discover how to boost your next campaign by using video.
Download Now
Pricing
Watch a Demo
Demo
Login
Start Free Trial
Efficient management of CUDA memory is a fundamental requirement in video applications where large frame buffers, high memory bandwidth, and low latency are critical factors. The performance of CUDA-based video pipelines depends heavily on how memory is allocated, accessed, and transferred between the host and GPU. This includes leveraging pinned host memory for faster transfers, using zero-copy buffering to minimize overhead, and organizing processing tasks across multiple CUDA streams to maximize concurrency. Proper cleanup and synchronization further ensure stability and resource efficiency in demanding real-time video workloads. Types of CUDA Memory Global Memory Global memory is the primary memory space accessible by all threads on the GPU. It is large in capacity but has high access latency compared to other memory types. In video applications, global memory is typically used for storing full-resolution frames, inference input/output tensors, and batch buffers. Efficient access patterns (e.g., coalesced reads/writes) are important for maximizing throughput. Pinned (Page-Locked) Host Memory Pinned, or page-locked, host memory is allocated on the CPU and locked so it cannot be paged out by the operating system. This enables direct memory access (DMA) transfers between host and device, significantly improving host-to-device (H2D) and device-to-host (D2H) transfer speeds. Pinned memory is required for asynchronous CUDA memory copies (e.g., cudaMemcpyAsync) and is critical for real-time video streaming and low-latency applications. However, excessive use of pinned memory can reduce overall system performance, as it limits available pageable memory for the OS. Unified Memory Unified memory provides a single address space shared between the CPU and the GPU. It simplifies programming by automatically migrating data between the host and the device as needed. However, for high-throughput video workloads, unified memory can introduce unpredictable latency due to page migration and may significantly reduce frame rates and increase GPU utilization. It is generally not recommended for performance-critical video pipelines. Device Memory Device memory is allocated directly on the GPU using cudaMalloc. All CUDA kernels and most DeepStream plugins operate on device memory for maximum efficiency. Device memory is used for intermediate frame buffers, inference tensors, and results. Proper management is crucial to avoid fragmentation and leaks, especially in long-running or multi-stream applications. Frame Buffer Allocation Strategy Static Allocation Static allocation involves pre-allocating memory buffers for the maximum expected batch size and frame resolution at application startup. This avoids runtime cudaMalloc calls, which are expensive and can cause fragmentation or latency spikes. For example, Pre-allocate memory buffers for the expected maximum batch size and resolution. Avoids runtime cudaMalloc calls. cudaMalloc(&frame_buffer, width * height * channels); Explanation : For NV12 or P010, ensure 1.5 or 2 bytes per pixel, respectively. For multiple frames, allocate a pool indexed by frame count or stream ID. Memory Reuse and Buffer Pooling Frequent allocation and deallocation of device buffers inside processing loops can lead to fragmentation and performance degradation. Instead, allocate a fixed pool of frame buffers at startup and reuse them across frames. For example: Allocation std::vector
frame_pool; for (int i = 0; i < max_frames; ++i) { uint8_t* d_buf; cudaMalloc(&d_buf, frame_size); frame_pool.push_back(d_buf); } Explanation : Track usage via flags or circular indexing. Prevents GPU heap fragmentation and allocation overhead. Zero-Copy Buffering Zero-copy allows CUDA kernels to access host memory directly without explicit host-to-device copies. This is achieved by mapping pinned host memory into the device address space or by using EGL images for interoperability between CUDA and other APIs (such as OpenGL or video decoders) cudaHostAlloc(&host_buf, size, cudaHostAllocMapped); cudaHostGetDevicePointer(&dev_ptr, host_buf, 0); Explanation : host_buf is page-locked system RAM. dev_ptr is a device-accessible alias to the same memory. Suitable for use cases like live camera ingest or direct socket capture into GPU-accessible memory. Bandwidth is lower than device memory, but it eliminates copy overhead. Copying Data to/from GPU Efficient data transfer between the host and the device is crucial in video applications. Use cudaMemcpy2D() or cudaMemcpy2DAsync() to move frame data, as these functions handle pitch (row alignment) and allow for asynchronous operation Synchronous Copy Synchronous copies block the CPU thread until the entire transfer is complete. These copies occur in the default stream and ensure that all previously issued CUDA work (on all streams) is completed before and after the copy. cudaMemcpy2D(d_frame, pitch, h_frame, width, width, height, cudaMemcpyHostToDevice); Explanation : Use for simple or one-off transfers. Requires all previous GPU work to be completed before and after the copy. Asynchronous Copy Asynchronous copies allow memory transfers to proceed in parallel with kernel execution, provided they occur on different CUDA streams. This is critical for maximizing throughput in pipelined video processing, where decoding, filtering, and encoding happen simultaneously. cudaMemcpy2DAsync(d_frame, pitch, h_frame, width, width, height, cudaMemcpyHostToDevice, stream); Explanation : Use asynchronous copies to overlap data transfer and computation. Ensure that h_frame is allocated with cudaHostAlloc() or cudaHostRegister() for async transfer compatibility. Using Page-Locked (Pinned) Host Memory Pinned memory ensures that the host buffer is not paged out and enables DMA transfers to/from the GPU. Allocation Example uint8_t* h_pinned;cudaHostAlloc(&h_pinned, buffer_size, cudaHostAllocDefault); Explanation : cudaHostAllocDefault: Allocates memory for both synchronous and asynchronous access. Other flags include cudaHostAllocMapped for zero-copy and cudaHostAllocWriteCombined for write-optimized memory. Stream-Optimized Frame Processing To achieve full pipeline throughput, allocate multiple buffers and process them using separate CUDA streams. Example: cudaStream_t stream1, stream2; cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); // Decode in stream1 cudaMemcpyAsync(d_input, h_input, size, cudaMemcpyHostToDevice, stream1); // Process in stream2 my_kernel<<
>>(d_input, d_output); Explanation : Synchronize streams only at critical points (e.g., encoder handoff). Assign one stream per stage (decode, process, encode) to decouple frame operations. Memory Cleanup At application shutdown, it is essential to free all allocated resources to prevent memory leaks and ensure a clean exit. cudaFree(d_frame); cudaFreeHost(h_pinned); cudaHostUnregister(h_buf); cudaStreamDestroy(stream1); Explanation : Always synchronize streams before freeing associated memory. For buffer pools, deallocate each cudaMalloc pointer in the pool vector. Use cudaDeviceReset() at application exit to release all remaining allocations (for debug/test only).