Architectural decision: production GPU efficiency — 3 варианта #2

New Issue

2026-05-21T07:20:42+01:00

gx commented

2026-05-21 07:20:42 +01:00

Контекст

Текущая архитектура runtime layout switching (3 cuda_grid instances + streamselect + 3-way split per cam) работает на RTX 5090 (dev) — 25-33% GPU util. На production hardware (RTX 4060 / A2000 / T4 — ~5× меньше throughput) saturated ~100% → не workable.

Проблема: streamselect выбирает один output, но upstream branches continuously produce frames чтобы быть ready к switch. То есть 3 cuda_grid composes + 12 scale_npp per frame даже если visible только один.

Нужно redesign — не генерировать то что не видно.

Замеры current architecture

Метрика	Значение
GPU utilization (5090)	25-33%
Pipeline container CPU	20.5%
Pipeline VRAM (наш ffmpeg portion)	~300 MB
cuda_grid composes/frame	3 (quad + single + mpp)
scale_npp ops/frame	12 (4 на каждый layout)
Latency layout switch	instant (ZMQ command)

Варианты redesign

Вариант 1 — Single fixed layout + smart cell content

Один cuda_grid (например main_plus_preview всегда). Контент cell — dynamic через streamselect per cell input.

4 cuframes → 4 scale_npp → 4 streamselect (per cell) → cuda_grid mpp → NVENC

Auto-selector назначает: main = highest-priority active, preview cells = другие active cameras (без duplicate).


GPU compute	1× baseline (vs 3× current)
VRAM	1× (один frame pool)
Layout variety	Limited (только mpp)
Switch latency	Instant (cell remap)
Dev cost	Low (controller только)
Production fit (RTX 4060)	✅

Trade-off: нет single/quad layouts. UX: всегда 4-cell view, акцент через main cell.

Вариант 2 — Pipeline restart on layout switch

Один cuda_grid с выбранным layout. При switch — controller вызывает docker restart cuda-grid-pipeline с новым layout option.


GPU compute	1× baseline
VRAM	1×
Layout variety	Full (quad/single/mpp/six_grid/etc.)
Switch latency	~2-3 sec (pipeline init + RTSP reconnect)
Dev cost	Low
Production fit	✅ но UX страдает

Trade-off: auto-layout = pipeline restarts при каждом motion change → TV freeze + reconnect. Не подходит для motion-driven auto switching. OK для manual user-initiated switches.

Вариант 3 — Filter rework (proper long-term)

vf_cuda_grid поддерживает runtime layout change + internal scaling per cell.

Изменения filter:

MAX_CELLS=16 pads, nb_active_cells dynamic per layout
process_command "layout <name>" → meняет cell_px[] + active count
process_command "cell_map <i> <pad>" → swap input pad для cell i
Internal NV12 scaling per cell (NPP Y plane + UV separate либо custom CUDA kernel)


GPU compute	1× baseline
VRAM	1×
Layout variety	Full + extensible
Switch latency	Instant
Dev cost	High (~2-3 days filter dev — custom kernel или NPP UV trick)
Production fit	✅✅ best long-term

Сравнение

Аспект	Current	Вариант 1	Вариант 2	Вариант 3
GPU load	3× ❌	1× ✅	1× ✅	1× ✅
Layouts	full	1 only	full	full
Switch latency	instant	instant	~3s ❌	instant
Auto motion-driven	✅	✅	❌	✅
Dev work	done	controller only	controller only	filter rework
Production-ready	❌	✅	partial	✅

## Контекст Текущая архитектура runtime layout switching (3 cuda_grid instances + streamselect + 3-way split per cam) **работает на RTX 5090 (dev)** — 25-33% GPU util. На **production hardware** (RTX 4060 / A2000 / T4 — ~5× меньше throughput) saturated ~100% → не workable. Проблема: streamselect выбирает **один** output, но upstream branches **continuously** produce frames чтобы быть ready к switch. То есть 3 cuda_grid composes + 12 scale_npp per frame даже если visible только один. Нужно redesign — **не генерировать то что не видно**. ## Замеры current architecture | Метрика | Значение | |---|---| | GPU utilization (5090) | 25-33% | | Pipeline container CPU | 20.5% | | Pipeline VRAM (наш ffmpeg portion) | ~300 MB | | cuda_grid composes/frame | **3** (quad + single + mpp) | | scale_npp ops/frame | **12** (4 на каждый layout) | | Latency layout switch | instant (ZMQ command) | ## Варианты redesign ### Вариант 1 — Single fixed layout + smart cell content Один cuda_grid (например `main_plus_preview` всегда). **Контент cell — dynamic** через streamselect per cell input. ``` 4 cuframes → 4 scale_npp → 4 streamselect (per cell) → cuda_grid mpp → NVENC ``` Auto-selector назначает: main = highest-priority active, preview cells = другие active cameras (без duplicate). | | | |---|---| | GPU compute | **1×** baseline (vs 3× current) | | VRAM | **1×** (один frame pool) | | Layout variety | Limited (только mpp) | | Switch latency | Instant (cell remap) | | Dev cost | **Low** (controller только) | | Production fit (RTX 4060) | ✅ | **Trade-off:** нет single/quad layouts. UX: всегда 4-cell view, акцент через main cell. ### Вариант 2 — Pipeline restart on layout switch Один cuda_grid с выбранным layout. При switch — controller вызывает `docker restart cuda-grid-pipeline` с новым layout option. | | | |---|---| | GPU compute | **1×** baseline | | VRAM | **1×** | | Layout variety | Full (quad/single/mpp/six_grid/etc.) | | Switch latency | **~2-3 sec** (pipeline init + RTSP reconnect) | | Dev cost | Low | | Production fit | ✅ но UX страдает | **Trade-off:** auto-layout = pipeline restarts при каждом motion change → TV freeze + reconnect. Не подходит для motion-driven auto switching. OK для manual user-initiated switches. ### Вариант 3 — Filter rework (proper long-term) `vf_cuda_grid` поддерживает runtime layout change + internal scaling per cell. Изменения filter: - MAX_CELLS=16 pads, `nb_active_cells` dynamic per layout - `process_command "layout <name>"` → meняет cell_px[] + active count - `process_command "cell_map <i> <pad>"` → swap input pad для cell i - Internal NV12 scaling per cell (NPP Y plane + UV separate либо custom CUDA kernel) | | | |---|---| | GPU compute | **1×** baseline | | VRAM | **1×** | | Layout variety | **Full + extensible** | | Switch latency | Instant | | Dev cost | **High** (~2-3 days filter dev — custom kernel или NPP UV trick) | | Production fit | ✅✅ best long-term | ## Сравнение | Аспект | Current | Вариант 1 | Вариант 2 | Вариант 3 | |---|---|---|---|---| | GPU load | 3× ❌ | 1× ✅ | 1× ✅ | 1× ✅ | | Layouts | full | 1 only | full | full | | Switch latency | instant | instant | ~3s ❌ | instant | | Auto motion-driven | ✅ | ✅ | ❌ | ✅ | | Dev work | done | controller only | controller only | filter rework | | Production-ready | ❌ | ✅ | partial | ✅ | ## Рекомендация **Кратко-средне:** Вариант 1 (smart cell content в фиксированном mpp). Получаем motion+priority highlighting на production hardware без filter dev. **Долгосрочно:** Вариант 3 — proper filter rework когда понадобится больше layouts. Принять решение когда будет понятно production hardware spec + feature requirements. ## Related - Hysteresis 3s debounce уже работает (controller commit 4cd2b4b) - Текущий код использует 3-layout streamselect архитектуру (для dev только) - localhost-infra commit 40b6974 = rollback complex split=4 mpp_main attempt

gx referenced this issue

2026-05-21 09:54:28 +01:00

Resilience: обрыв любого input не должен ломать output #3

gx referenced this issue from a commit

2026-05-25 15:57:24 +01:00

pipeline_monitor: 2 bug fixes — stall detection + lost MQTT event

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: gx/vf-cuda-grid#2