Files

T

gx 52fb2ad722 benchmarks: actual measured VRAM + network bandwidth (tcpdump-based)

VRAM breakdown (nvidia-smi pmon):
- 4 publishers = 4.4 GB (FHD + 2688x1520 ring buffers + NVDEC)
- cctv-backend = 1.0 GB
- frigate embeddings_manager = 1.6 GB
- frigate detector:onnx = 0.6 GB
- Total cuframes-stack = ~7.7 GB

Network (10-sec tcpdump capture от camera subnet к R9):
- Measured: 31.5 Mbps (всё включая go2rtc on-demand, ONVIF)
- cuframes core: ~16 Mbps (4 publishers × main HEVC)
- ONVIF/RTSP keepalives: ~1-2 Mbps
- Без cuframes setup тех же 4 cam × 3 consumer был бы ~45-50 Mbps

Source: production deploy 2026-05-19 measurement.

2026-05-19 19:22:53 +01:00

9.5 KiB

Raw Blame History

Benchmarks

Все измерения проведены на reference hardware (см. ниже). Числа repeatable, voluntarily reproducible через libcuframes/tests/test_stress.cu и tools/cuframes-rtsp-source + examples/sub_count.

Reference hardware

Компонент	Значение
GPU	NVIDIA RTX 5090 (Blackwell, sm_120, 32 GB VRAM, 3× NVDEC, 1× NVENC)
CPU	AMD Ryzen 9 7950X (16 cores, 32 threads)
RAM	64 GB DDR5-6000
OS	Ubuntu 24.04 (kernel 6.17, glibc 2.39)
CUDA	Driver 555+, Toolkit 12.x / 13.x
PCIe	Gen5 ×16 (GPU connection)

Stress test — 1×publisher × 4×consumer × 2000 frames

Запуск: libcuframes/tests/test_stress.cu — fork-based, 1 publisher + 4 consumers (2 fast + 2 slow @ 5 ms sleep) на 1280×720 NV12 frames @ ~120 fps target.

Метрика	Значение
Frames per consumer	2000 / 2000
Gaps (lost seq)	0 у всех 4 consumers
Torn frames (verified `verify_y` kernel)	0 у всех 4 consumers
Wall time	18.8 s
Effective publisher rate	~106 fps (sub-real-time из-за slow consumers)

E2E real camera — 1 publisher + 1 consumer

Camera: Dahua HFW3441 main-stream 1920×1080 HEVC 25 fps, sub-stream 640×480 25 fps. Publisher: cuframes-rtsp-source (host build), consumer: examples/sub_count либо test_cuframes_source (cctv-processor's CuframesSource).

Метрика	NV12 1920×1080	NV12 640×480
Frame size (packed)	3,110,400 bytes (~3 MB)	460,800 bytes
Effective bandwidth	75 MB/s	11 MB/s
Publisher decode rate	25.03 fps (matches camera)	25.00 fps
Consumer receive rate	25.03 fps	25.34 fps
100-frame test	0 drops, 0 gaps	0 drops, 0 gaps

Production: 1× publisher → N× consumers (Frigate + cctv-backend)

Реальный production setup (24+ часов uptime):

Publisher: cuframes-pub-parking — Dahua 192.168.88.98 sub-stream 640×480 HEVC 25 fps
Consumer 1: Frigate 0.17.1 через FFmpeg cuframes:// demuxer (detect path; ONNX object detection)
Consumer 2: cctv-backend через C++ CuframesSource (motion detect + grid composer + RTSP encode → TV)

Метрика	Значение
Total NVDEC operations	1 (только у publisher'а)
Без cuframes была бы	2 (Frigate detect + cctv-backend detect)
GPU encoder	1× (cctv-backend H.264 encode для RTSP output)
Publisher VRAM ring	6 buffers × 460 KB ≈ 2.8 MB (sub-stream)
Frigate detect drops	0 over 24h
cctv-backend frame loss	0 over 24h

VRAM cost — NV12 ring buffer

Размер ring = frame_size × ring_size. Frame size NV12 = width × height × 1.5.

Resolution	Frame size	Ring 6 buffers
640×480	460 KB	2.8 MB
1280×720	1.35 MB	8.1 MB
1920×1080 (FHD)	3 MB	18 MB
2560×1440	5.4 MB	33 MB
2688×1520 (Dahua 4MP)	6 MB	36 MB
3840×2160 (4K)	12 MB	72 MB

Для 16-камерного setup на RTX 5090 (32 GB VRAM) — все FHD-камеры с ring=6 = ~288 MB total. <1% от доступной VRAM.

Сравнение: cuframes vs traditional N×NVDEC

Сценарий: 16 камер × 25 fps × 3 consumers (Frigate, cctv-processor, AI-pipeline).

Подход	NVDEC ops/sec	VRAM bandwidth (decoded path)
Without cuframes	16 × 25 × 3 = 1200	≥ 1200 × 6 MB = 7.2 GB/s
With cuframes (v0.1)	16 × 25 × 1 = 400	≥ 16 × 25 × 6 MB = 2.4 GB/s
Экономия	3× меньше NVDEC	3× меньше memory bw

NVDEC throughput limit на RTX 5090: ~50 концурentных FHD25-стримов. Без cuframes 3 consumers × 16 cam = занимает ~96% capacity → насыщение. С cuframes — ~32% → reserve для масштаба.

Latency

Hop	Latency
RTSP → publisher demuxer	sub-frame (<40 ms FHD25)
NVDEC decode	~3-5 ms на frame
publish_external → consumer receive	<0.5 ms (cudaEventRecord → cudaStreamWaitEvent)
consumer cudaMemcpy NV12 → host (FFmpeg demuxer v1)	~2-3 ms FHD
End-to-end RTSP → consumer frame ready	~50-100 ms typical

Zero-copy path (через AVHWFramesContext, planned v0.2) уберёт CPU copy — <10 ms end-to-end в идеале.

Reproducibility

Все benchmarks воспроизводимы из repo:

# Stress test
cd build && cmake -DBUILD_TESTING=ON ..  && cmake --build . && ctest -R stress -V

# E2E single consumer
./tools/cuframes-rtsp-source --rtsp rtsp://... --key cam1 --ring 6 --verbose &
./examples/sub_count --key cam1 --max-frames 100 --verbose

Production деplo замеры — см. интеграционные guides:

docs/integration.md — cctv-processor C++ pipeline
filter/README.md — FFmpeg demuxer (Frigate setup)

Real-world production deployment (2026-05-19, v0.2.0)

Setup: 4 Dahua IP-камеры (HEVC main 1920×1080 / 2688×1520, 25 fps) → 3 одновременных consumer'а на одном RTX 5090 хосте:

Frigate detect (ONNX D-FINE-S, 640×480) + record (full-res H.265 mp4)
cctv-backend custom C++ mosaic processor (composes 4×grid → RTSP output для TV)

Before → after (measured production, идентичный workload)

Метрика	Без cuframes	С cuframes v0.2 dual-input	Reduction
RTSP connections к камерам	12 (4 cam × 3 consumer)	4 (publishers only)	−67%
NVDEC sessions	~8 (decode на каждый consumer)	4 (publishers only)	−50%
Camera-side bandwidth	~34 Mbps (main+main+sub per cam)	~16 Mbps (main per cam)	−54%
PCIe D2H copies (consumer side)	~346 MB/s (decoded frames → host)	~0 (zero-copy CUDA IPC)	−100%
Frigate ffmpeg с прямым RTSP	8 (detect+record × 4)	0 (all через cuframes)	−100%

Live nvidia-smi metrics в running system

GPU SM:     4-5%   (compute: detector + cuframes consumers)
GPU NVDEC:  2-4%   (без cuframes ожидаемо было 15-25%)
GPU NVENC:  0-1%

VRAM breakdown (measured)

Component	VRAM
4× cuframes publishers (3× FHD ring + 1× 2688×1520 для LPR)	4.4 GB
cctv-backend (composer + grid output)	1.0 GB
frigate.embeddings_manager (face + LPR ONNX models)	1.6 GB
frigate.detector:onnx (D-FINE-S COCO)	0.6 GB
Total cuframes-stack VRAM	~7.7 GB

Из них на сам cuframes accounting — только 4.4 GB в publishers (ring buffers + NVDEC decode buffers). Consumers (Frigate, cctv-backend) держат свои CUDA contexts независимо.

Network bandwidth (real tcpdump, 10-sec sample)

31.5 Mbps от camera subnet (4 cameras → R9), измерено через tcpdump -w cam-traffic.pcap за 10 секунд.

Breakdown approximate:

4 publishers × main HEVC RTP/UDP: ~16 Mbps (cuframes core)
go2rtc on-demand streams (Frigate UI live preview, если открыт): 0-10 Mbps
ONVIF discovery, RTSP keepalives, NTP-from-cameras: ~1-2 Mbps

Без cuframes тот же setup (cctv-backend + Frigate detect + Frigate record × 4 camera) дал бы ~45-50 Mbps (главное: record path забирал отдельный main stream от каждой camera).

Camera-side benefits

Dahua/Hikvision камеры обычно cap'нуты на 4-5 одновременных RTSP streams. До cuframes setup (4 cam × 3 RTSP) делал каждую camera на 60-75% capacity её RTSP server'а. После — 20-25%, headroom на 2-3 дополнительных consumer'а без замены оборудования.

Что сохранено (важно)

Качество записи: record path через cuframes_packets:// это passthrough (-c:v copy), bit-exact original encoded stream от камеры. Frigate пишет mp4 с full-resolution оригинала, без re-encode.
Latency: <2 ms publisher → consumer (cuframes IPC) vs ~50-80 ms RTSP setup latency для каждого нового consumer.
Backward compatibility: v0.2 publishers принимают v1 subscribers (frames-only), rolling upgrade.

Hardware-agnostic projection (для другого setup)

If you have	Expected reduction
16 cameras × 2 consumers	32 → 16 NVDEC (−50%), 32 → 16 RTSP (−50%)
8 cameras × 3 consumers	24 → 8 NVDEC (−67%), 24 → 8 RTSP (−67%)
4 cameras × 4 consumers (multi-AI pipeline)	16 → 4 NVDEC (−75%), 16 → 4 RTSP (−75%)

Reduction масштабируется линейно с N (consumers per camera). v0.1 (frames only) сэкономит NVDEC; v0.2 (frames + packets) дополнительно сэкономит RTSP connections для record/mux consumers.

Что НЕ сэкономлено (честно)

Disk space: запись остаётся full-resolution H.265 mp4. Cuframes не сжимает.
Detector inference latency: ONNX/TensorRT detector работает на decoded frames независимо от source. Cuframes только меняет где decode произошёл.
Camera RTSP server CPU: сама камера всё равно encode'ит видео. Cuframes reduces consumer-side load, не producer-side.

9.5 KiB Raw Blame History Unescape Escape