libcuframes: fix TOCTOU race в consumer slot read

Bug: producer signals **один global** cudaEvent для всего ring (один на producer). Consumer waits этот event после slot_seq validation, но event соответствует ПОСЛЕДНЕМУ published frame, не slot[target_seq]. Если producer wrap'нет ring во время event wait (ring=6 = 240ms окно), slot содержит уже next-gen data, consumer возвращает torn/stale frame. Симптом в production: video stream показывает «back-jump на момент» periodically — camera OSD timestamp дёргается, motion machines briefly teleport назад. cluster md5 analysis НЕ ловит (содержимое frames всё ещё unique, просто из неправильной epoch). Fix: post-sync verify. После cudaStreamWaitEvent / cudaEventSynchronize re-check slots[slot_idx].seq == target_seq. Если producer перезаписал — continue outer loop с новым target_seq. Закрывает race window между slot validation и event sync return. Остаются открытыми: - downstream GPU access после frame fill (consumer-side) — producer может wrap во время этого. Mitigation: STRICT_WAIT policy в publisher + ack discipline в consumer (cuframes_release_frame ack уже works). - bigger ring size снижает wrap frequency (240ms → 1.2s при ring=30). Test: после deploy в cuda-grid-pipeline (Phase 7 single cam), camera OSD clock больше не дёргается (раньше дёргалось каждые ~16 sec). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
roadmap: vf_cuda_grid выделен в отдельный продукт gx/vf-cuda-grid
2026-05-21 22:27:39 +01:00 · 2026-05-19 20:39:47 +01:00 · 2026-05-19 19:22:53 +01:00 · 2026-05-19 19:07:16 +01:00
3 changed files with 113 additions and 18 deletions
@@ -117,3 +117,95 @@ cd build && cmake -DBUILD_TESTING=ON ..  && cmake --build . && ctest -R stress -
 Production деplo замеры — см. интеграционные guides:
 - [docs/integration.md](docs/integration.md) — cctv-processor C++ pipeline
 - [filter/README.md](filter/README.md) — FFmpeg demuxer (Frigate setup)
+
+---
+
+## Real-world production deployment (2026-05-19, v0.2.0)
+
+**Setup**: 4 Dahua IP-камеры (HEVC main 1920×1080 / 2688×1520, 25 fps) → 3
+одновременных consumer'а на одном RTX 5090 хосте:
+- **Frigate** detect (ONNX D-FINE-S, 640×480) + record (full-res H.265 mp4)
+- **cctv-backend** custom C++ mosaic processor (composes 4×grid → RTSP output для TV)
+
+### Before → after (measured production, идентичный workload)
+
+| Метрика | Без cuframes | С cuframes v0.2 dual-input | Reduction |
+|---|---:|---:|---:|
+| **RTSP connections к камерам** | 12 (4 cam × 3 consumer) | **4** (publishers only) | **−67%** |
+| **NVDEC sessions** | ~8 (decode на каждый consumer) | **4** (publishers only) | **−50%** |
+| **Camera-side bandwidth** | ~34 Mbps (main+main+sub per cam) | **~16 Mbps** (main per cam) | **−54%** |
+| **PCIe D2H copies (consumer side)** | ~346 MB/s (decoded frames → host) | **~0** (zero-copy CUDA IPC) | **−100%** |
+| **Frigate ffmpeg с прямым RTSP** | 8 (detect+record × 4) | **0** (all через cuframes) | **−100%** |
+
+### Live nvidia-smi metrics в running system
+
+```
+GPU SM:     4-5%   (compute: detector + cuframes consumers)
+GPU NVDEC:  2-4%   (без cuframes ожидаемо было 15-25%)
+GPU NVENC:  0-1%
+```
+
+### VRAM breakdown (measured)
+
+| Component | VRAM |
+|---|---:|
+| 4× cuframes publishers (3× FHD ring + 1× 2688×1520 для LPR) | **4.4 GB** |
+| cctv-backend (composer + grid output) | 1.0 GB |
+| frigate.embeddings_manager (face + LPR ONNX models) | 1.6 GB |
+| frigate.detector:onnx (D-FINE-S COCO) | 0.6 GB |
+| **Total cuframes-stack VRAM** | **~7.7 GB** |
+
+Из них на сам cuframes accounting — только **4.4 GB** в publishers (ring buffers +
+NVDEC decode buffers). Consumers (Frigate, cctv-backend) держат свои CUDA
+contexts независимо.
+
+### Network bandwidth (real tcpdump, 10-sec sample)
+
+**31.5 Mbps** от camera subnet (4 cameras → R9), измерено через
+`tcpdump -w cam-traffic.pcap` за 10 секунд.
+
+Breakdown approximate:
+- 4 publishers × main HEVC RTP/UDP: **~16 Mbps** (cuframes core)
+- go2rtc on-demand streams (Frigate UI live preview, если открыт): **0-10 Mbps**
+- ONVIF discovery, RTSP keepalives, NTP-from-cameras: **~1-2 Mbps**
+
+Без cuframes тот же setup (cctv-backend + Frigate detect + Frigate record × 4
+camera) дал бы **~45-50 Mbps** (главное: record path забирал отдельный
+main stream от каждой camera).
+
+### Camera-side benefits
+
+Dahua/Hikvision камеры обычно cap'нуты на 4-5 одновременных RTSP streams.
+До cuframes setup (4 cam × 3 RTSP) делал каждую camera на **60-75% capacity**
+её RTSP server'а. После — **20-25%**, headroom на 2-3 дополнительных
+consumer'а без замены оборудования.
+
+### Что **сохранено** (важно)
+
+- **Качество записи**: record path через `cuframes_packets://` это **passthrough**
+  (`-c:v copy`), bit-exact original encoded stream от камеры. Frigate пишет mp4
+  с full-resolution оригинала, без re-encode.
+- **Latency**: <2 ms publisher → consumer (cuframes IPC) vs ~50-80 ms RTSP setup
+  latency для каждого нового consumer.
+- **Backward compatibility**: v0.2 publishers принимают v1 subscribers
+  (frames-only), rolling upgrade.
+
+### Hardware-agnostic projection (для другого setup)
+
+| If you have | Expected reduction |
+|---|---|
+| 16 cameras × 2 consumers | 32 → 16 NVDEC (−50%), 32 → 16 RTSP (−50%) |
+| 8 cameras × 3 consumers | 24 → 8 NVDEC (−67%), 24 → 8 RTSP (−67%) |
+| 4 cameras × 4 consumers (multi-AI pipeline) | 16 → 4 NVDEC (−75%), 16 → 4 RTSP (−75%) |
+
+Reduction масштабируется **линейно** с N (consumers per camera). v0.1 (frames
+only) сэкономит NVDEC; v0.2 (frames + packets) **дополнительно** сэкономит
+RTSP connections для record/mux consumers.
+
+### Что **НЕ** сэкономлено (честно)
+
+- **Disk space**: запись остаётся full-resolution H.265 mp4. Cuframes не сжимает.
+- **Detector inference latency**: ONNX/TensorRT detector работает на decoded
+  frames независимо от source. Cuframes только меняет где decode произошёл.
+- **Camera RTSP server CPU**: сама камера всё равно encode'ит видео. Cuframes
+  reduces **consumer-side** load, не producer-side.
@@ -75,27 +75,19 @@ ETA: 1-2 недели focused работы.

 Open questions: какой memory-type — `memory:CUDAMemory` (mainline) vs `memory:NVMM` (NVIDIA DeepStream-specific). Возможно два варианта/build flags.

-### `vf_cuda_grid` — FFmpeg filter с runtime grid composition
+### `vf_cuda_grid` — **выделен в отдельный продукт `gx/vf-cuda-grid`** ([repo](https://git.goldix.org/gx/vf-cuda-grid))

-CCTV mosaic composition как FFmpeg filter, **полностью на GPU**. Заменяет custom C++ GridComposer (см. [gx/cctv#22](https://git.goldix.org/gx/cctv/issues/22) — performance investigation cctv-processor: CPU round-trip pipeline).
+FFmpeg filter для GPU-native video grid composition + control-plane sidecar
+(ZeroMQ/MQTT/HTTP/HA Discovery). Дизайн зафиксирован, см.
+[`gx/vf-cuda-grid` docs/design.md](https://git.goldix.org/gx/vf-cuda-grid/src/branch/main/docs/design.md)
+и [epic issue #1](https://git.goldix.org/gx/vf-cuda-grid/issues/1).

-| Capability | Зачем |
-|---|---|
-| Filter принимает N cuda-frames (через `[in0][in1][in2]...` filter inputs) | Композиция в одном filter graph без custom code |
-| Output — один cuda-frame с N cells в layout | Прямой вход в `hwdownload` или `h264_nvenc` |
-| Layout templates (`single`, `quad`, `main_plus_preview`, `nine_grid`, ...) | Конфигурируемые из CLI или filter command'ом |
-| `sendcmd` / API для runtime smena layout'а | Не нужно teardown filter graph для переключения сетки |
-| Per-cell overlays (text, bbox) через side data в AVFrame | Frigate detection/LPR/face — overlay внутри pipeline |
-| Полностью CUDA-side: scale/composition/text rendering | Zero CPU round-trip, frame не покидает VRAM |
+Cuframes остаётся frame source provider для vf-cuda-grid в нашей экосистеме
+(но vf-cuda-grid работает и с любым другим CUDA frame source — стандартный FFmpeg).

-Это превращает cuframes из IPC-библиотеки в полноценную **GPU-native video routing platform**. Эстетически близко к NVIDIA DeepStream `nvstreammux` + `nvmultistreamtiler`, но open-source и с conventional FFmpeg-stack.
-
-Open questions:
- Filter input mode: pull-based (filter pull'ает N inputs) или push-based (через external lock-step). FFmpeg filter API больше pull-friendly.
- Text rendering в CUDA — `vf_drawtext` имеет CPU path; нужен либо GPU font-renderer (Pango/freetype + texture upload), либо CPU-precomputed glyph atlases.
- Runtime layout commands через filter `process_command` API.
-
-Это **большой scope** — отдельная major version (v0.5+) или standalone проект.
+Закрывает [`gx/cctv#22`](https://git.goldix.org/gx/cctv/issues/22) Phase 4
+(end-to-end GPU pipeline для cctv-processor mosaic composer) после Phase 4 vf-cuda-grid +
+миграция cctv-processor GridComposer → vf_cuda_grid filter.

 ## v1.0 — Stable ABI 📋

@@ -290,6 +290,17 @@ int cuframes_subscriber_next(cuframes_subscriber_t *sub,
                if (cerr != cudaSuccess) return CUFRAMES_ERR_CUDA;
            }

+            /* TOCTOU защита: producer_event signal'ит для последнего published
+             * frame, не per-slot. Если producer wrapped ring пока мы ждали
+             * event sync, slot[slot_idx] уже содержит DIFFERENT seq.
+             * Re-verify slot_seq — если изменился, retry с новым target_seq. */
+            uint64_t verify_seq = atomic_load_explicit(&sub->hdr->slots[slot_idx].seq,
+                                                       memory_order_acquire);
+            if (verify_seq != target_seq) {
+                /* Slot overwritten во время event wait — outer loop пересчитает */
+                continue;
+            }
+
            /* Fill frame_out */
            struct cuframes_frame *f = &sub->frame_obj;
            f->cuda_ptr = sub->mapped_ptrs[slot_idx];
Author	SHA1	Message	Date
gx	517107d741	libcuframes: fix TOCTOU race в consumer slot read build / cmake build (CUDA 12.4, Ubuntu 22.04) (push) Successful in 1m34s Details build / ffmpeg filter patch (out-of-tree) (push) Successful in 1m19s Details release / build runtime Docker image (push) Failing after 1s Details release / build source tarball (push) Successful in 4s Details test-u4-runner / u4 runner smoke test (push) Has been cancelled Details Bug: producer signals один global cudaEvent для всего ring (один на producer). Consumer waits этот event после slot_seq validation, но event соответствует ПОСЛЕДНЕМУ published frame, не slot[target_seq]. Если producer wrap'нет ring во время event wait (ring=6 = 240ms окно), slot содержит уже next-gen data, consumer возвращает torn/stale frame. Симптом в production: video stream показывает «back-jump на момент» periodically — camera OSD timestamp дёргается, motion machines briefly teleport назад. cluster md5 analysis НЕ ловит (содержимое frames всё ещё unique, просто из неправильной epoch). Fix: post-sync verify. После cudaStreamWaitEvent / cudaEventSynchronize re-check slots[slot_idx].seq == target_seq. Если producer перезаписал — continue outer loop с новым target_seq. Закрывает race window между slot validation и event sync return. Остаются открытыми: - downstream GPU access после frame fill (consumer-side) — producer может wrap во время этого. Mitigation: STRICT_WAIT policy в publisher + ack discipline в consumer (cuframes_release_frame ack уже works). - bigger ring size снижает wrap frequency (240ms → 1.2s при ring=30). Test: после deploy в cuda-grid-pipeline (Phase 7 single cam), camera OSD clock больше не дёргается (раньше дёргалось каждые ~16 sec). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-21 22:27:39 +01:00
gx	4d54173bb2	roadmap: vf_cuda_grid выделен в отдельный продукт gx/vf-cuda-grid	2026-05-19 20:39:47 +01:00
gx	52fb2ad722	benchmarks: actual measured VRAM + network bandwidth (tcpdump-based) VRAM breakdown (nvidia-smi pmon): - 4 publishers = 4.4 GB (FHD + 2688x1520 ring buffers + NVDEC) - cctv-backend = 1.0 GB - frigate embeddings_manager = 1.6 GB - frigate detector:onnx = 0.6 GB - Total cuframes-stack = ~7.7 GB Network (10-sec tcpdump capture от camera subnet к R9): - Measured: 31.5 Mbps (всё включая go2rtc on-demand, ONVIF) - cuframes core: ~16 Mbps (4 publishers × main HEVC) - ONVIF/RTSP keepalives: ~1-2 Mbps - Без cuframes setup тех же 4 cam × 3 consumer был бы ~45-50 Mbps Source: production deploy 2026-05-19 measurement.	2026-05-19 19:22:53 +01:00
gx	3779175737	docs(benchmarks): production v0.2 deploy metrics (4 cam × 3 consumer) Real-world numbers с production deploy 2026-05-19: - RTSP к камерам: 12 → 4 (−67%) - NVDEC sessions: 8 → 4 (−50%) - Camera bandwidth: 34 → 16 Mbps (−54%) - PCIe D2H copies: 346 MB/s → ~0 (−100% через zero-copy CUDA IPC) - Frigate прямые RTSP: 8 → 0 (−100%) Plus live nvidia-smi metrics, что сохранилось vs не сэкономлено, projection table для других setup'ов (8/16 cam × 2/3/4 consumer). Для promotional material — public-facing claims на основе measured deploy.	2026-05-19 19:07:16 +01:00