VRAM breakdown (nvidia-smi pmon): - 4 publishers = 4.4 GB (FHD + 2688x1520 ring buffers + NVDEC) - cctv-backend = 1.0 GB - frigate embeddings_manager = 1.6 GB - frigate detector:onnx = 0.6 GB - Total cuframes-stack = ~7.7 GB Network (10-sec tcpdump capture от camera subnet к R9): - Measured: 31.5 Mbps (всё включая go2rtc on-demand, ONVIF) - cuframes core: ~16 Mbps (4 publishers × main HEVC) - ONVIF/RTSP keepalives: ~1-2 Mbps - Без cuframes setup тех же 4 cam × 3 consumer был бы ~45-50 Mbps Source: production deploy 2026-05-19 measurement.
9.5 KiB
Benchmarks
Все измерения проведены на reference hardware (см. ниже). Числа repeatable,
voluntarily reproducible через libcuframes/tests/test_stress.cu и
tools/cuframes-rtsp-source + examples/sub_count.
Reference hardware
| Компонент | Значение |
|---|---|
| GPU | NVIDIA RTX 5090 (Blackwell, sm_120, 32 GB VRAM, 3× NVDEC, 1× NVENC) |
| CPU | AMD Ryzen 9 7950X (16 cores, 32 threads) |
| RAM | 64 GB DDR5-6000 |
| OS | Ubuntu 24.04 (kernel 6.17, glibc 2.39) |
| CUDA | Driver 555+, Toolkit 12.x / 13.x |
| PCIe | Gen5 ×16 (GPU connection) |
Stress test — 1×publisher × 4×consumer × 2000 frames
Запуск: libcuframes/tests/test_stress.cu — fork-based, 1 publisher + 4 consumers
(2 fast + 2 slow @ 5 ms sleep) на 1280×720 NV12 frames @ ~120 fps target.
| Метрика | Значение |
|---|---|
| Frames per consumer | 2000 / 2000 |
| Gaps (lost seq) | 0 у всех 4 consumers |
Torn frames (verified verify_y kernel) |
0 у всех 4 consumers |
| Wall time | 18.8 s |
| Effective publisher rate | ~106 fps (sub-real-time из-за slow consumers) |
E2E real camera — 1 publisher + 1 consumer
Camera: Dahua HFW3441 main-stream 1920×1080 HEVC 25 fps, sub-stream 640×480 25 fps.
Publisher: cuframes-rtsp-source (host build), consumer: examples/sub_count либо
test_cuframes_source (cctv-processor's CuframesSource).
| Метрика | NV12 1920×1080 | NV12 640×480 |
|---|---|---|
| Frame size (packed) | 3,110,400 bytes (~3 MB) | 460,800 bytes |
| Effective bandwidth | 75 MB/s | 11 MB/s |
| Publisher decode rate | 25.03 fps (matches camera) | 25.00 fps |
| Consumer receive rate | 25.03 fps | 25.34 fps |
| 100-frame test | 0 drops, 0 gaps | 0 drops, 0 gaps |
Production: 1× publisher → N× consumers (Frigate + cctv-backend)
Реальный production setup (24+ часов uptime):
- Publisher:
cuframes-pub-parking— Dahua 192.168.88.98 sub-stream 640×480 HEVC 25 fps - Consumer 1: Frigate 0.17.1 через FFmpeg
cuframes://demuxer (detect path; ONNX object detection) - Consumer 2: cctv-backend через C++
CuframesSource(motion detect + grid composer + RTSP encode → TV)
| Метрика | Значение |
|---|---|
| Total NVDEC operations | 1 (только у publisher'а) |
| Без cuframes была бы | 2 (Frigate detect + cctv-backend detect) |
| GPU encoder | 1× (cctv-backend H.264 encode для RTSP output) |
| Publisher VRAM ring | 6 buffers × 460 KB ≈ 2.8 MB (sub-stream) |
| Frigate detect drops | 0 over 24h |
| cctv-backend frame loss | 0 over 24h |
VRAM cost — NV12 ring buffer
Размер ring = frame_size × ring_size. Frame size NV12 = width × height × 1.5.
| Resolution | Frame size | Ring 6 buffers |
|---|---|---|
| 640×480 | 460 KB | 2.8 MB |
| 1280×720 | 1.35 MB | 8.1 MB |
| 1920×1080 (FHD) | 3 MB | 18 MB |
| 2560×1440 | 5.4 MB | 33 MB |
| 2688×1520 (Dahua 4MP) | 6 MB | 36 MB |
| 3840×2160 (4K) | 12 MB | 72 MB |
Для 16-камерного setup на RTX 5090 (32 GB VRAM) — все FHD-камеры с ring=6 = ~288 MB total. <1% от доступной VRAM.
Сравнение: cuframes vs traditional N×NVDEC
Сценарий: 16 камер × 25 fps × 3 consumers (Frigate, cctv-processor, AI-pipeline).
| Подход | NVDEC ops/sec | VRAM bandwidth (decoded path) |
|---|---|---|
| Without cuframes | 16 × 25 × 3 = 1200 | ≥ 1200 × 6 MB = 7.2 GB/s |
| With cuframes (v0.1) | 16 × 25 × 1 = 400 | ≥ 16 × 25 × 6 MB = 2.4 GB/s |
| Экономия | 3× меньше NVDEC | 3× меньше memory bw |
NVDEC throughput limit на RTX 5090: ~50 концурentных FHD25-стримов. Без cuframes 3 consumers × 16 cam = занимает ~96% capacity → насыщение. С cuframes — ~32% → reserve для масштаба.
Latency
| Hop | Latency |
|---|---|
| RTSP → publisher demuxer | sub-frame (<40 ms FHD25) |
| NVDEC decode | ~3-5 ms на frame |
| publish_external → consumer receive | <0.5 ms (cudaEventRecord → cudaStreamWaitEvent) |
| consumer cudaMemcpy NV12 → host (FFmpeg demuxer v1) | ~2-3 ms FHD |
| End-to-end RTSP → consumer frame ready | ~50-100 ms typical |
Zero-copy path (через AVHWFramesContext, planned v0.2) уберёт CPU copy — <10 ms
end-to-end в идеале.
Reproducibility
Все benchmarks воспроизводимы из repo:
# Stress test
cd build && cmake -DBUILD_TESTING=ON .. && cmake --build . && ctest -R stress -V
# E2E single consumer
./tools/cuframes-rtsp-source --rtsp rtsp://... --key cam1 --ring 6 --verbose &
./examples/sub_count --key cam1 --max-frames 100 --verbose
Production деplo замеры — см. интеграционные guides:
- docs/integration.md — cctv-processor C++ pipeline
- filter/README.md — FFmpeg demuxer (Frigate setup)
Real-world production deployment (2026-05-19, v0.2.0)
Setup: 4 Dahua IP-камеры (HEVC main 1920×1080 / 2688×1520, 25 fps) → 3 одновременных consumer'а на одном RTX 5090 хосте:
- Frigate detect (ONNX D-FINE-S, 640×480) + record (full-res H.265 mp4)
- cctv-backend custom C++ mosaic processor (composes 4×grid → RTSP output для TV)
Before → after (measured production, идентичный workload)
| Метрика | Без cuframes | С cuframes v0.2 dual-input | Reduction |
|---|---|---|---|
| RTSP connections к камерам | 12 (4 cam × 3 consumer) | 4 (publishers only) | −67% |
| NVDEC sessions | ~8 (decode на каждый consumer) | 4 (publishers only) | −50% |
| Camera-side bandwidth | ~34 Mbps (main+main+sub per cam) | ~16 Mbps (main per cam) | −54% |
| PCIe D2H copies (consumer side) | ~346 MB/s (decoded frames → host) | ~0 (zero-copy CUDA IPC) | −100% |
| Frigate ffmpeg с прямым RTSP | 8 (detect+record × 4) | 0 (all через cuframes) | −100% |
Live nvidia-smi metrics в running system
GPU SM: 4-5% (compute: detector + cuframes consumers)
GPU NVDEC: 2-4% (без cuframes ожидаемо было 15-25%)
GPU NVENC: 0-1%
VRAM breakdown (measured)
| Component | VRAM |
|---|---|
| 4× cuframes publishers (3× FHD ring + 1× 2688×1520 для LPR) | 4.4 GB |
| cctv-backend (composer + grid output) | 1.0 GB |
| frigate.embeddings_manager (face + LPR ONNX models) | 1.6 GB |
| frigate.detector:onnx (D-FINE-S COCO) | 0.6 GB |
| Total cuframes-stack VRAM | ~7.7 GB |
Из них на сам cuframes accounting — только 4.4 GB в publishers (ring buffers + NVDEC decode buffers). Consumers (Frigate, cctv-backend) держат свои CUDA contexts независимо.
Network bandwidth (real tcpdump, 10-sec sample)
31.5 Mbps от camera subnet (4 cameras → R9), измерено через
tcpdump -w cam-traffic.pcap за 10 секунд.
Breakdown approximate:
- 4 publishers × main HEVC RTP/UDP: ~16 Mbps (cuframes core)
- go2rtc on-demand streams (Frigate UI live preview, если открыт): 0-10 Mbps
- ONVIF discovery, RTSP keepalives, NTP-from-cameras: ~1-2 Mbps
Без cuframes тот же setup (cctv-backend + Frigate detect + Frigate record × 4 camera) дал бы ~45-50 Mbps (главное: record path забирал отдельный main stream от каждой camera).
Camera-side benefits
Dahua/Hikvision камеры обычно cap'нуты на 4-5 одновременных RTSP streams. До cuframes setup (4 cam × 3 RTSP) делал каждую camera на 60-75% capacity её RTSP server'а. После — 20-25%, headroom на 2-3 дополнительных consumer'а без замены оборудования.
Что сохранено (важно)
- Качество записи: record path через
cuframes_packets://это passthrough (-c:v copy), bit-exact original encoded stream от камеры. Frigate пишет mp4 с full-resolution оригинала, без re-encode. - Latency: <2 ms publisher → consumer (cuframes IPC) vs ~50-80 ms RTSP setup latency для каждого нового consumer.
- Backward compatibility: v0.2 publishers принимают v1 subscribers (frames-only), rolling upgrade.
Hardware-agnostic projection (для другого setup)
| If you have | Expected reduction |
|---|---|
| 16 cameras × 2 consumers | 32 → 16 NVDEC (−50%), 32 → 16 RTSP (−50%) |
| 8 cameras × 3 consumers | 24 → 8 NVDEC (−67%), 24 → 8 RTSP (−67%) |
| 4 cameras × 4 consumers (multi-AI pipeline) | 16 → 4 NVDEC (−75%), 16 → 4 RTSP (−75%) |
Reduction масштабируется линейно с N (consumers per camera). v0.1 (frames only) сэкономит NVDEC; v0.2 (frames + packets) дополнительно сэкономит RTSP connections для record/mux consumers.
Что НЕ сэкономлено (честно)
- Disk space: запись остаётся full-resolution H.265 mp4. Cuframes не сжимает.
- Detector inference latency: ONNX/TensorRT detector работает на decoded frames независимо от source. Cuframes только меняет где decode произошёл.
- Camera RTSP server CPU: сама камера всё равно encode'ит видео. Cuframes reduces consumer-side load, не producer-side.