# Benchmarks Все измерения проведены на reference hardware (см. ниже). Числа repeatable, voluntarily reproducible через `libcuframes/tests/test_stress.cu` и [`tools/cuframes-rtsp-source`](tools/cuframes-rtsp-source) + `examples/sub_count`. ## Reference hardware | Компонент | Значение | |---|---| | GPU | NVIDIA RTX 5090 (Blackwell, sm_120, 32 GB VRAM, 3× NVDEC, 1× NVENC) | | CPU | AMD Ryzen 9 7950X (16 cores, 32 threads) | | RAM | 64 GB DDR5-6000 | | OS | Ubuntu 24.04 (kernel 6.17, glibc 2.39) | | CUDA | Driver 555+, Toolkit 12.x / 13.x | | PCIe | Gen5 ×16 (GPU connection) | ## Stress test — 1×publisher × 4×consumer × 2000 frames Запуск: `libcuframes/tests/test_stress.cu` — fork-based, 1 publisher + 4 consumers (2 fast + 2 slow @ 5 ms sleep) на 1280×720 NV12 frames @ ~120 fps target. | Метрика | Значение | |---|---| | Frames per consumer | 2000 / 2000 | | Gaps (lost seq) | **0 у всех 4 consumers** | | Torn frames (verified `verify_y` kernel) | **0 у всех 4 consumers** | | Wall time | 18.8 s | | Effective publisher rate | ~106 fps (sub-real-time из-за slow consumers) | ## E2E real camera — 1 publisher + 1 consumer Camera: Dahua HFW3441 main-stream 1920×1080 HEVC 25 fps, sub-stream 640×480 25 fps. Publisher: `cuframes-rtsp-source` (host build), consumer: `examples/sub_count` либо `test_cuframes_source` (cctv-processor's `CuframesSource`). | Метрика | NV12 1920×1080 | NV12 640×480 | |---|---|---| | Frame size (packed) | 3,110,400 bytes (~3 MB) | 460,800 bytes | | Effective bandwidth | 75 MB/s | 11 MB/s | | Publisher decode rate | 25.03 fps (matches camera) | 25.00 fps | | Consumer receive rate | 25.03 fps | 25.34 fps | | 100-frame test | 0 drops, 0 gaps | 0 drops, 0 gaps | ## Production: 1× publisher → N× consumers (Frigate + cctv-backend) Реальный production setup (24+ часов uptime): - Publisher: `cuframes-pub-parking` — Dahua 192.168.88.98 sub-stream 640×480 HEVC 25 fps - Consumer 1: **Frigate 0.17.1** через FFmpeg `cuframes://` demuxer (detect path; ONNX object detection) - Consumer 2: **cctv-backend** через C++ `CuframesSource` (motion detect + grid composer + RTSP encode → TV) | Метрика | Значение | |---|---| | Total NVDEC operations | **1** (только у publisher'а) | | Без cuframes была бы | **2** (Frigate detect + cctv-backend detect) | | GPU encoder | 1× (cctv-backend H.264 encode для RTSP output) | | Publisher VRAM ring | 6 buffers × 460 KB ≈ **2.8 MB** (sub-stream) | | Frigate detect drops | 0 over 24h | | cctv-backend frame loss | 0 over 24h | ## VRAM cost — NV12 ring buffer Размер ring = `frame_size × ring_size`. Frame size NV12 = `width × height × 1.5`. | Resolution | Frame size | Ring 6 buffers | |---|---|---| | 640×480 | 460 KB | 2.8 MB | | 1280×720 | 1.35 MB | 8.1 MB | | 1920×1080 (FHD) | 3 MB | 18 MB | | 2560×1440 | 5.4 MB | 33 MB | | 2688×1520 (Dahua 4MP) | 6 MB | 36 MB | | 3840×2160 (4K) | 12 MB | 72 MB | Для 16-камерного setup на RTX 5090 (32 GB VRAM) — все FHD-камеры с ring=6 = ~288 MB total. **<1% от доступной VRAM.** ## Сравнение: cuframes vs traditional N×NVDEC Сценарий: 16 камер × 25 fps × 3 consumers (Frigate, cctv-processor, AI-pipeline). | Подход | NVDEC ops/sec | VRAM bandwidth (decoded path) | |---|---|---| | Without cuframes | 16 × 25 × 3 = **1200** | ≥ 1200 × 6 MB = 7.2 GB/s | | With cuframes (v0.1) | 16 × 25 × 1 = **400** | ≥ 16 × 25 × 6 MB = 2.4 GB/s | | **Экономия** | **3× меньше NVDEC** | **3× меньше memory bw** | NVDEC throughput limit на RTX 5090: ~50 концурentных FHD25-стримов. Без cuframes 3 consumers × 16 cam = занимает ~96% capacity → насыщение. С cuframes — ~32% → reserve для масштаба. ## Latency | Hop | Latency | |---|---| | RTSP → publisher demuxer | sub-frame (<40 ms FHD25) | | NVDEC decode | ~3-5 ms на frame | | publish_external → consumer receive | **<0.5 ms** (cudaEventRecord → cudaStreamWaitEvent) | | consumer cudaMemcpy NV12 → host (FFmpeg demuxer v1) | ~2-3 ms FHD | | **End-to-end RTSP → consumer frame ready** | ~50-100 ms typical | Zero-copy path (через `AVHWFramesContext`, planned v0.2) уберёт CPU copy — `<10 ms` end-to-end в идеале. ## Reproducibility Все benchmarks воспроизводимы из repo: ```bash # Stress test cd build && cmake -DBUILD_TESTING=ON .. && cmake --build . && ctest -R stress -V # E2E single consumer ./tools/cuframes-rtsp-source --rtsp rtsp://... --key cam1 --ring 6 --verbose & ./examples/sub_count --key cam1 --max-frames 100 --verbose ``` Production деplo замеры — см. интеграционные guides: - [docs/integration.md](docs/integration.md) — cctv-processor C++ pipeline - [filter/README.md](filter/README.md) — FFmpeg demuxer (Frigate setup) --- ## Real-world production deployment (2026-05-19, v0.2.0) **Setup**: 4 Dahua IP-камеры (HEVC main 1920×1080 / 2688×1520, 25 fps) → 3 одновременных consumer'а на одном RTX 5090 хосте: - **Frigate** detect (ONNX D-FINE-S, 640×480) + record (full-res H.265 mp4) - **cctv-backend** custom C++ mosaic processor (composes 4×grid → RTSP output для TV) ### Before → after (measured production, идентичный workload) | Метрика | Без cuframes | С cuframes v0.2 dual-input | Reduction | |---|---:|---:|---:| | **RTSP connections к камерам** | 12 (4 cam × 3 consumer) | **4** (publishers only) | **−67%** | | **NVDEC sessions** | ~8 (decode на каждый consumer) | **4** (publishers only) | **−50%** | | **Camera-side bandwidth** | ~34 Mbps (main+main+sub per cam) | **~16 Mbps** (main per cam) | **−54%** | | **PCIe D2H copies (consumer side)** | ~346 MB/s (decoded frames → host) | **~0** (zero-copy CUDA IPC) | **−100%** | | **Frigate ffmpeg с прямым RTSP** | 8 (detect+record × 4) | **0** (all через cuframes) | **−100%** | ### Live nvidia-smi metrics в running system ``` GPU SM: 4-5% (compute: detector + cuframes consumers) GPU NVDEC: 2-4% (без cuframes ожидаемо было 15-25%) GPU NVENC: 0-1% ``` ### VRAM breakdown (measured) | Component | VRAM | |---|---:| | 4× cuframes publishers (3× FHD ring + 1× 2688×1520 для LPR) | **4.4 GB** | | cctv-backend (composer + grid output) | 1.0 GB | | frigate.embeddings_manager (face + LPR ONNX models) | 1.6 GB | | frigate.detector:onnx (D-FINE-S COCO) | 0.6 GB | | **Total cuframes-stack VRAM** | **~7.7 GB** | Из них на сам cuframes accounting — только **4.4 GB** в publishers (ring buffers + NVDEC decode buffers). Consumers (Frigate, cctv-backend) держат свои CUDA contexts независимо. ### Network bandwidth (real tcpdump, 10-sec sample) **31.5 Mbps** от camera subnet (4 cameras → R9), измерено через `tcpdump -w cam-traffic.pcap` за 10 секунд. Breakdown approximate: - 4 publishers × main HEVC RTP/UDP: **~16 Mbps** (cuframes core) - go2rtc on-demand streams (Frigate UI live preview, если открыт): **0-10 Mbps** - ONVIF discovery, RTSP keepalives, NTP-from-cameras: **~1-2 Mbps** Без cuframes тот же setup (cctv-backend + Frigate detect + Frigate record × 4 camera) дал бы **~45-50 Mbps** (главное: record path забирал отдельный main stream от каждой camera). ### Camera-side benefits Dahua/Hikvision камеры обычно cap'нуты на 4-5 одновременных RTSP streams. До cuframes setup (4 cam × 3 RTSP) делал каждую camera на **60-75% capacity** её RTSP server'а. После — **20-25%**, headroom на 2-3 дополнительных consumer'а без замены оборудования. ### Что **сохранено** (важно) - **Качество записи**: record path через `cuframes_packets://` это **passthrough** (`-c:v copy`), bit-exact original encoded stream от камеры. Frigate пишет mp4 с full-resolution оригинала, без re-encode. - **Latency**: <2 ms publisher → consumer (cuframes IPC) vs ~50-80 ms RTSP setup latency для каждого нового consumer. - **Backward compatibility**: v0.2 publishers принимают v1 subscribers (frames-only), rolling upgrade. ### Hardware-agnostic projection (для другого setup) | If you have | Expected reduction | |---|---| | 16 cameras × 2 consumers | 32 → 16 NVDEC (−50%), 32 → 16 RTSP (−50%) | | 8 cameras × 3 consumers | 24 → 8 NVDEC (−67%), 24 → 8 RTSP (−67%) | | 4 cameras × 4 consumers (multi-AI pipeline) | 16 → 4 NVDEC (−75%), 16 → 4 RTSP (−75%) | Reduction масштабируется **линейно** с N (consumers per camera). v0.1 (frames only) сэкономит NVDEC; v0.2 (frames + packets) **дополнительно** сэкономит RTSP connections для record/mux consumers. ### Что **НЕ** сэкономлено (честно) - **Disk space**: запись остаётся full-resolution H.265 mp4. Cuframes не сжимает. - **Detector inference latency**: ONNX/TensorRT detector работает на decoded frames независимо от source. Cuframes только меняет где decode произошёл. - **Camera RTSP server CPU**: сама камера всё равно encode'ит видео. Cuframes reduces **consumer-side** load, не producer-side.