14 Commits

Author SHA1 Message Date
gx 636bd78854 vf_cuda_grid: overlay с invalid cell — silent skip вместо AVERROR
Когда controller broadcast'ит overlay command ко всем cuda_grid instances
(quad/single/mpp), overlay с cell=2 (gate_lpr cell в quad/mpp) попадает
в single layout который имеет nb_cells=1 → overlay_pixel_rect возвращал
AVERROR(EINVAL) → render_overlays propagated к cuda_grid_compose →
"Error while filtering: Invalid argument" каждый frame → pipeline crash loop.

Fix: return 1 (positive skip) вместо AVERROR. render_overlay_rect уже
имеет `if (ret != 0) return ret < 0 ? ret : 0;` — positive skip правильно
обрабатывается без error.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-21 07:08:23 +01:00
gx a326ef146c vf_cuda_grid: Phase 6 — process_command 'reload_icon <name>'
Invalidate cached icon atlas by name. Next render автоматически re-reads PNG
file с disk. Used controller'ом для periodic re-render dynamic overlays
(charts/chats) — write new PNG → send reload_icon → filter подхватывает.

Без этого pattern dynamic overlay невозможен — icon overlay cached forever
after first load by icon_name.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 21:48:09 +01:00
gx c5130cb15c vf_cuda_grid: Phase 4b-4 — icon overlay (PNG/JPG decode + sprite blit)
PNG/JPG decode через libavcodec (системный decoder), conversion в RGBA8
через swscale, upload в pitched CUDA buffer, blit через Alpha_Blit_RGBA_Y/UV
(те же kernels что text).

Icon cache — shared by name (одна и та же иконка для нескольких overlays =
один GPU atlas). Lifecycle: load on first use, free в uninit.

Path resolution: <icon_dir>/<name> + try .png/.jpg extensions:
  /var/lib/cuda-grid/icons  (default)
  /opt/cuda-grid/icons      (fallback)
  + abs path supported

Options:
  icon_dir=  — overrides default search paths

Wire format: add_overlay <id> icon x=.. y=.. icon_name=domofon opacity=200

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:41:11 +01:00
gx b88f966f83 vf_cuda_grid: URL-decode для string overlay fields (text, icon_name)
Controller URL-encode'ит string values (text="hello world" → text=hello%20world)
чтобы пройти sscanf("%s") который stops on whitespace.
Filter inline decode'ит '%xx' obratno в bytes. Только для text + icon_name.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:32:38 +01:00
gx 4010461300 vf_cuda_grid: Phase 4b-3 — text overlay (freetype + RGBA atlas)
CPU rasterization через freetype (DejaVu/Liberation auto-detect или
font_file= option), upload pitched RGBA buffer to GPU, blit через
Alpha_Blit_RGBA_Y/UV kernels (Phase 4b-2 уже had).

Cache:
  per-overlay-id atlas с keys (text, font_size, r/g/b) — re-rasterize
  только при change. Cleanup при remove_overlay/clear_overlays/uninit.

Options:
  font_file=  — TTF path (default: search DejaVu/Liberation)
  font_size=  — default size if overlay не указал свой

Wire format: add_overlay <id> text x=.. y=.. text=hello font_size=24 r=255 g=255 b=255 opacity=200

Conditional на CONFIG_LIBFREETYPE — без него text overlays no-op
(остальные типы работают как обычно).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:30:36 +01:00
gx 1e54f04e24 configure: add cuda_grid_filter_deps_any для cuda_nvcc/cuda_llvm
Без этого filter compileд но .ptx step failед: configure не knows что
filter требует nvcc/clang для .cu compile. Same pattern как scale_cuda.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:26:12 +01:00
gx 8ca590004b vf_cuda_grid: Phase 4b-2 — CUDA kernels (alpha-blended overlays)
Custom .cu kernels компилируются через ffbuild → .ptx, embedded в binary
через bin2c pattern (как vf_scale_cuda). Loading через ff_cuda_load_module
в config_output, kernel handles cached на life-of-filter.

Kernels:
  Alpha_Fill_Y/UV          — solid colour α-blend (rect, dim, border strips)
  Alpha_Blit_RGBA_Y/UV     — blit RGBA atlas → NV12 (text/icon в 4b-3/4)

Render side:
  render_strip_alpha       — заменил render_strip_solid (alpha=255 ≡ solid)
  render_overlay_rect      — opacity передаётся в kernel
  render_overlay_dim       — Y=16/UV=neutral × dim.amount (затемнение region)

Также: убран unused cuda_grid_config_input (-Werror), добавлены fields
CUmodule + 4×CUfunction в Context; unload в uninit.

cuMemsetD2D* НЕ доступны в FFmpeg's CudaFunctions wrapper, поэтому solid
fill реализован через kernel — лишний overhead minimal (16×16 threads).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:24:01 +01:00
gx 9deaca7697 vf_cuda_grid: Phase 4b-1 — rect overlay primitives (solid fill, no alpha)
Добавляет inner overlay state с mutex + process_command handler.
Rendering filled/border rects через cuMemsetD2D8Async/D2D16Async — без
custom kernel'а (Phase 4b-2 = alpha blend, требует .cu).

Commands:
  add_overlay    <id> rect cell=N x=.. y=.. w=.. h=.. r=.. g=.. b=.. thickness=.. opacity=..
  remove_overlay <id>
  clear_overlays

text/icon/dim — типы определены, render заглушен до Phase 4b-2/3/4.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 22:17:41 +01:00
gx 178fc5bb4e vf_cuda_grid: Phase 2b — delegated scaling to upstream scale_npp
После попытки in-filter NPP scaling обнаружено что nppiResize не имеет
_C2R variant для NV12 UV interleaved (только C1R, C3R, C4R). Alternatives:
- 2× nppiResize_8u_C1R с split/merge через intermediate buffers
- custom CUDA kernel
- treat UV pair как 16u (blending artifact на boundaries)

Pragmatic decision: cuda_grid делает только composition (same-size memcpy),
а scaling делегируется существующему scale_npp filter в filter chain:

  [0]scale_npp=1280:1080[s0]; [1]scale_npp=640:360[s1]; ... \
  [s0][s1]...cuda_grid=layout=main_plus_preview

Unix philosophy + leverages production-tested NPP code. Controller (Phase 3)
auto-generates filter graph с scale_npp per input.

Revert:
- #include <nppi.h>
- libnpp dependency в configure (cuda_grid_filter_deps="ffnvcodec")
- nppiResize* calls в compose path

Add:
- Error message с примером scale_npp chain pattern
- Doc в file header c filter graph пример

Phase 2 = full deliverable (2a + 2b). Дальше Phase 3 controller.
2026-05-19 21:45:40 +01:00
gx 11f310061a vf_cuda_grid: Phase 2b — NPP scaling для mixed-size inputs
- Add libnpp dependency в configure (cuda_grid_filter_deps)
- #include <nppi.h>, nppSetStream(s->cu_stream) перед resize batch
- Smart copy-or-scale в compose path:
  - src.size == cell.size → cuMemcpy2DAsync (fast path, zero overhead)
  - else → nppiResizeSqrPixel_8u_C1R для Y + _C2R для NV12 UV interleaved
- NPPI_INTER_LINEAR interpolation (bilinear — стандартный для video)
- Destination pointer offset через explicit pointer arithmetic
  (NPP не имеет dst_offset параметра, нужно сместить pSrc указатель)

Это unblock'ает mixed-size cameras (parking 1920x1080 + gate_lpr 2688x1520
в одном grid с main_plus_preview layout — big cell scaled до 1280x1080,
small cells scaled до 640x360).

Phase 2 complete (2a + 2b). Phase 3 будет controller sidecar.
2026-05-19 21:20:04 +01:00
gx df476472e2 vf_cuda_grid: fix include avstring.h для av_asprintf 2026-05-19 20:58:29 +01:00
gx 6ee2f474c7 vf_cuda_grid: Phase 2a — layout templates + dynamic nb_inputs
Layout templates (9):
- single, dual_horizontal, dual_vertical
- quad (default), main_plus_preview (1 big + 3 small)
- six_grid (3x2), nine_grid (3x3), sixteen_grid (4x4)
- panoramic

Cells определены в normalized координатах (0.0-1.0), переводятся в pixels
в config_output (× out_w/out_h). Alignment до chroma boundary (NV12 ÷ 2).

Filter options:
- layout=<name> (default quad)
- out_w=<int> (default 1920)
- out_h=<int> (default 1080)

Dynamic inputs:
- nb_inputs derived из layout (single=1, quad=4, nine_grid=9, sixteen_grid=16)
- ff_append_inpad_free_name в init() для каждой cell
- AVFILTER_FLAG_DYNAMIC_INPUTS на filter

Phase 2a limitation:
- Каждый input должен быть точно cell_px size (no scaling).
- Phase 2b добавит NPP resize для mixed-size inputs.
2026-05-19 20:57:08 +01:00
gx 4313c3f30d vf_cuda_grid: fix #include cuda_check.h + mixed decl warnings (-Werror) 2026-05-19 20:50:17 +01:00
gx 097ca81605 vf_cuda_grid: Phase 1 MVP — fixed quad layout, 4 CUDA inputs → 1 output
Phase 1 deliverable (см. gx/vf-cuda-grid#1):
- libavfilter/vf_cuda_grid.c (~270 LOC): multi-input filter, fixed 2×2 quad
- 4 NV12 CUDA frames same size → 2W × 2H output frame
- Composition: cuMemcpy2DAsync per Y + UV plane на каждый input
- framesync для lock-step pull всех 4 inputs
- Output hw_frames_ctx allocated from input device_ref
- Build wiring: CONFIG_CUDA_GRID_FILTER → libavfilter/{Makefile,allfilters.c}, configure deps на ffnvcodec

Limitations Phase 1:
- All inputs must be same size (no scaling)
- Quad layout hardcoded (no DSL, no runtime switching)
- NV12 only (no RGBA/YUV420P)

Phase 2: dynamic layouts + scaling. Phase 3: runtime control via process_command.
2026-05-19 20:47:00 +01:00
5 changed files with 1805 additions and 0 deletions
Vendored
+2
View File
@@ -3317,6 +3317,8 @@ thumbnail_cuda_filter_deps_any="cuda_nvcc cuda_llvm"
transpose_npp_filter_deps="ffnvcodec libnpp"
overlay_cuda_filter_deps="ffnvcodec"
overlay_cuda_filter_deps_any="cuda_nvcc cuda_llvm"
cuda_grid_filter_deps="ffnvcodec"
cuda_grid_filter_deps_any="cuda_nvcc cuda_llvm"
sharpen_npp_filter_deps="ffnvcodec libnpp"
ddagrab_filter_deps="d3d11va IDXGIOutput1 DXGI_OUTDUPL_FRAME_INFO"
+2
View File
@@ -410,6 +410,8 @@ OBJS-$(CONFIG_OSCILLOSCOPE_FILTER) += vf_datascope.o
OBJS-$(CONFIG_OVERLAY_FILTER) += vf_overlay.o framesync.o
OBJS-$(CONFIG_OVERLAY_CUDA_FILTER) += vf_overlay_cuda.o framesync.o vf_overlay_cuda.ptx.o \
cuda/load_helper.o
OBJS-$(CONFIG_CUDA_GRID_FILTER) += vf_cuda_grid.o framesync.o \
vf_cuda_grid.ptx.o cuda/load_helper.o
OBJS-$(CONFIG_OVERLAY_OPENCL_FILTER) += vf_overlay_opencl.o opencl.o \
opencl/overlay.o framesync.o
OBJS-$(CONFIG_OVERLAY_QSV_FILTER) += vf_overlay_qsv.o framesync.o
+1
View File
@@ -390,6 +390,7 @@ extern const AVFilter ff_vf_overlay_qsv;
extern const AVFilter ff_vf_overlay_vaapi;
extern const AVFilter ff_vf_overlay_vulkan;
extern const AVFilter ff_vf_overlay_cuda;
extern const AVFilter ff_vf_cuda_grid;
extern const AVFilter ff_vf_owdenoise;
extern const AVFilter ff_vf_pad;
extern const AVFilter ff_vf_pad_opencl;
File diff suppressed because it is too large Load Diff
+121
View File
@@ -0,0 +1,121 @@
/*
* cuda_grid overlay CUDA kernels.
*
* Алфа-блендинг poверх NV12 frame:
* - Alpha_Fill_Y / Alpha_Fill_UV: solid colour fill region (rect/dim)
* - Alpha_Blit_RGBA_Y / Alpha_Blit_RGBA_UV: blit RGBA atlas → NV12
* (для text Phase 4b-3 и icon Phase 4b-4)
*
* BT.709 limited-range conversion (HDTV). См. также vf_cuda_grid.c rgb_to_yuv709.
*
* Лицензия: LGPL-2.1+ (соответствует FFmpeg)
*/
#include "cuda/vector_helpers.cuh"
extern "C" {
/* Solid colour α-blend на Y plane.
* dst: pointer на Y plane base
* dst_pitch: bytes per row
* rx, ry, rw, rh: region rect (pixels, must be in-bounds)
* fill: 0..255 (Y component to blend in)
* alpha: 0..255 (255 = fully opaque)
*/
__global__ void Alpha_Fill_Y(unsigned char *dst, int dst_pitch,
int rx, int ry, int rw, int rh,
int fill, int alpha)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x >= rw || y >= rh) return;
unsigned char *p = dst + (ry + y) * dst_pitch + (rx + x);
int cur = *p;
*p = (unsigned char)((fill * alpha + cur * (255 - alpha)) / 255);
}
/* Solid colour α-blend на UV plane (NV12 interleaved).
* rx, ry: in chroma plane coords (= full-res Y x/2, y/2)
* rw, rh: also в chroma coords
*/
__global__ void Alpha_Fill_UV(unsigned char *dst, int dst_pitch,
int rx, int ry, int rw, int rh,
int fill_u, int fill_v, int alpha)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x >= rw || y >= rh) return;
unsigned char *p = dst + (ry + y) * dst_pitch + (rx + x) * 2;
int cu = p[0], cv = p[1];
p[0] = (unsigned char)((fill_u * alpha + cu * (255 - alpha)) / 255);
p[1] = (unsigned char)((fill_v * alpha + cv * (255 - alpha)) / 255);
}
/* Blit RGBA atlas → Y plane с α-blending.
* dst, dst_pitch: Y plane
* dx, dy: destination pixel offset (full-res coords)
* atlas, atlas_pitch: RGBA source (interleaved R,G,B,A bytes), pitch в bytes
* w, h: atlas dimensions (in pixels)
* extra_alpha: дополнительный множитель (0..255) — overlay-level opacity
*/
__global__ void Alpha_Blit_RGBA_Y(unsigned char *dst, int dst_pitch,
int dx, int dy,
const unsigned char *atlas, int atlas_pitch,
int w, int h, int extra_alpha)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x >= w || y >= h) return;
const unsigned char *sp = atlas + y * atlas_pitch + x * 4;
int r = sp[0], g = sp[1], b = sp[2], a = sp[3];
a = a * extra_alpha / 255;
if (a == 0) return;
int Y = (int)(0.183f * r + 0.614f * g + 0.062f * b) + 16;
Y = Y < 0 ? 0 : (Y > 255 ? 255 : Y);
unsigned char *p = dst + (dy + y) * dst_pitch + (dx + x);
int cur = *p;
*p = (unsigned char)((Y * a + cur * (255 - a)) / 255);
}
/* Blit RGBA atlas → UV plane с 4:2:0 chroma subsampling.
* dst, dst_pitch: UV plane
* dx, dy: destination pixel offset (full-res coords; обе должны быть кратны 2)
* atlas, atlas_pitch: RGBA source
* w, h: atlas dimensions (full-res; UV operates на w/2 × h/2)
* extra_alpha: 0..255
*/
__global__ void Alpha_Blit_RGBA_UV(unsigned char *dst, int dst_pitch,
int dx, int dy,
const unsigned char *atlas, int atlas_pitch,
int w, int h, int extra_alpha)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int hw = w / 2, hh = h / 2;
if (x >= hw || y >= hh) return;
int sx = x * 2, sy = y * 2;
const unsigned char *row0 = atlas + sy * atlas_pitch;
const unsigned char *row1 = atlas + (sy + 1) * atlas_pitch;
int r = (row0[sx*4+0] + row0[(sx+1)*4+0] + row1[sx*4+0] + row1[(sx+1)*4+0]) >> 2;
int g = (row0[sx*4+1] + row0[(sx+1)*4+1] + row1[sx*4+1] + row1[(sx+1)*4+1]) >> 2;
int b = (row0[sx*4+2] + row0[(sx+1)*4+2] + row1[sx*4+2] + row1[(sx+1)*4+2]) >> 2;
int a = (row0[sx*4+3] + row0[(sx+1)*4+3] + row1[sx*4+3] + row1[(sx+1)*4+3]) >> 2;
a = a * extra_alpha / 255;
if (a == 0) return;
int U = (int)(-0.101f * r - 0.339f * g + 0.439f * b) + 128;
int V = (int)( 0.439f * r - 0.399f * g - 0.040f * b) + 128;
U = U < 0 ? 0 : (U > 255 ? 255 : U);
V = V < 0 ? 0 : (V > 255 ? 255 : V);
int du = dx / 2 + x;
int dv = dy / 2 + y;
unsigned char *p = dst + dv * dst_pitch + du * 2;
int cu = p[0], cv = p[1];
p[0] = (unsigned char)((U * a + cu * (255 - a)) / 255);
p[1] = (unsigned char)((V * a + cv * (255 - a)) / 255);
}
} /* extern "C" */