Using CUDA for Real-time Video Effects & Filters

>>(d_rgb_frame, width, height); Use 16x16 threads per block for warp-aligned access. For 4K and larger frames, larger grids are needed to avoid thread underutilization. Box Blur Filter (3x3 Kernel) Box blur computes the average of each pixel’s 3×3 neighborhood. It's a separable convolution but implemented here as a naïve full kernel for clarity. CUDA Kernel __global__ void box_blur(uint8_t* input, uint8_t* output, int width, int height) { int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; if (x < 1 || y < 1 || x >= width - 1 || y >= height - 1) return; int idx = (y * width + x) * 3; for (int c = 0; c < 3; ++c) { int sum = 0; for (int j = -1; j <= 1; ++j) for (int i = -1; i <= 1; ++i) sum += input[((y + j) * width + (x + i)) * 3 + c]; output[idx + c] = sum / 9; } } Replace with a separable filter (horizontal + vertical) for better performance. Use shared memory to avoid global memory re-access per neighbor. Gamma Correction Applies non-linear luminance adjustment to each pixel based on gamma. Affects brightness and contrast in a perceptually accurate way. CUDA Kernel __global__ void gamma_correct(uint8_t* frame, float gamma, int width, int height) { int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; if (x >= width || y >= height) return; int idx = (y * width + x) * 3; for (int c = 0; c < 3; ++c) { float normalized = frame[idx + c] / 255.0f; normalized = powf(normalized, gamma); frame[idx + c] = __saturatef(normalized) * 255.0f; } } Use __saturatef() to clamp values to [0, 1] range. Optionally precompute LUTs for fixed gamma values to improve performance. CUDA Streams for Async Processing Use CUDA streams to overlap the execution of multiple filters or memory transfers with kernel launches. This is crucial for pipelining video frames in real-time. cudaStream_t stream; cudaStreamCreate(&stream); // Launch kernels in a stream invert_color<<>>(d_rgb_frame, width, height); box_blur<<>>(d_input, d_output, width, height); // Overlap memory transfers (if needed) cudaMemcpyAsync(..., cudaMemcpyHostToDevice, stream); cudaStreamSynchronize(stream); // Wait for completion Multiple streams can be used for multi-stage frame processing. Stream priority and concurrency tuning may be necessary for multi-frame batching.

Products

Video Hosting

CincoTube

Pages

Players & Galleries

Solutions

eLearning & Training

Enterprise

eCommerce

Sales & Marketing

Video API

Strapi Plugin

Flutter Plugin

Sanity

JotForms DeepUploader

Meet Cincopa

Integration

Demos

Professional Service

Help & Support Centre

Product Updates

Customer Stories