Files
cuframes/BENCHMARKS.md
T
gx a3ba3a95b2 docs: ROADMAP + CHANGELOG v0.1.0 + BENCHMARKS
- ROADMAP.md: structured v0.1 / v0.2📋 (encoded packet sharing + FFmpeg
  upstream PR + scale-cuda alt) / v0.3 (Python bindings, Jetson, multi-GPU)
  / v1.0 (stable ABI)
- CHANGELOG.md: full v0.1.0 release notes — features, tested config,
  production deployment, known limitations
- BENCHMARKS.md: measurements (stress 1×pub×4×sub, E2E real camera,
  prod multi-consumer 24h, VRAM cost per resolution, cuframes vs N×NVDEC)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 21:11:37 +01:00

4.9 KiB
Raw Blame History

Benchmarks

Все измерения проведены на reference hardware (см. ниже). Числа repeatable, voluntarily reproducible через libcuframes/tests/test_stress.cu и tools/cuframes-rtsp-source + examples/sub_count.

Reference hardware

Компонент Значение
GPU NVIDIA RTX 5090 (Blackwell, sm_120, 32 GB VRAM, 3× NVDEC, 1× NVENC)
CPU AMD Ryzen 9 7950X (16 cores, 32 threads)
RAM 64 GB DDR5-6000
OS Ubuntu 24.04 (kernel 6.17, glibc 2.39)
CUDA Driver 555+, Toolkit 12.x / 13.x
PCIe Gen5 ×16 (GPU connection)

Stress test — 1×publisher × 4×consumer × 2000 frames

Запуск: libcuframes/tests/test_stress.cu — fork-based, 1 publisher + 4 consumers (2 fast + 2 slow @ 5 ms sleep) на 1280×720 NV12 frames @ ~120 fps target.

Метрика Значение
Frames per consumer 2000 / 2000
Gaps (lost seq) 0 у всех 4 consumers
Torn frames (verified verify_y kernel) 0 у всех 4 consumers
Wall time 18.8 s
Effective publisher rate ~106 fps (sub-real-time из-за slow consumers)

E2E real camera — 1 publisher + 1 consumer

Camera: Dahua HFW3441 main-stream 1920×1080 HEVC 25 fps, sub-stream 640×480 25 fps. Publisher: cuframes-rtsp-source (host build), consumer: examples/sub_count либо test_cuframes_source (cctv-processor's CuframesSource).

Метрика NV12 1920×1080 NV12 640×480
Frame size (packed) 3,110,400 bytes (~3 MB) 460,800 bytes
Effective bandwidth 75 MB/s 11 MB/s
Publisher decode rate 25.03 fps (matches camera) 25.00 fps
Consumer receive rate 25.03 fps 25.34 fps
100-frame test 0 drops, 0 gaps 0 drops, 0 gaps

Production: 1× publisher → N× consumers (Frigate + cctv-backend)

Реальный production setup (24+ часов uptime):

  • Publisher: cuframes-pub-parking — Dahua 192.168.88.98 sub-stream 640×480 HEVC 25 fps
  • Consumer 1: Frigate 0.17.1 через FFmpeg cuframes:// demuxer (detect path; ONNX object detection)
  • Consumer 2: cctv-backend через C++ CuframesSource (motion detect + grid composer + RTSP encode → TV)
Метрика Значение
Total NVDEC operations 1 (только у publisher'а)
Без cuframes была бы 2 (Frigate detect + cctv-backend detect)
GPU encoder 1× (cctv-backend H.264 encode для RTSP output)
Publisher VRAM ring 6 buffers × 460 KB ≈ 2.8 MB (sub-stream)
Frigate detect drops 0 over 24h
cctv-backend frame loss 0 over 24h

VRAM cost — NV12 ring buffer

Размер ring = frame_size × ring_size. Frame size NV12 = width × height × 1.5.

Resolution Frame size Ring 6 buffers
640×480 460 KB 2.8 MB
1280×720 1.35 MB 8.1 MB
1920×1080 (FHD) 3 MB 18 MB
2560×1440 5.4 MB 33 MB
2688×1520 (Dahua 4MP) 6 MB 36 MB
3840×2160 (4K) 12 MB 72 MB

Для 16-камерного setup на RTX 5090 (32 GB VRAM) — все FHD-камеры с ring=6 = ~288 MB total. <1% от доступной VRAM.

Сравнение: cuframes vs traditional N×NVDEC

Сценарий: 16 камер × 25 fps × 3 consumers (Frigate, cctv-processor, AI-pipeline).

Подход NVDEC ops/sec VRAM bandwidth (decoded path)
Without cuframes 16 × 25 × 3 = 1200 ≥ 1200 × 6 MB = 7.2 GB/s
With cuframes (v0.1) 16 × 25 × 1 = 400 ≥ 16 × 25 × 6 MB = 2.4 GB/s
Экономия 3× меньше NVDEC 3× меньше memory bw

NVDEC throughput limit на RTX 5090: ~50 концурentных FHD25-стримов. Без cuframes 3 consumers × 16 cam = занимает ~96% capacity → насыщение. С cuframes — ~32% → reserve для масштаба.

Latency

Hop Latency
RTSP → publisher demuxer sub-frame (<40 ms FHD25)
NVDEC decode ~3-5 ms на frame
publish_external → consumer receive <0.5 ms (cudaEventRecord → cudaStreamWaitEvent)
consumer cudaMemcpy NV12 → host (FFmpeg demuxer v1) ~2-3 ms FHD
End-to-end RTSP → consumer frame ready ~50-100 ms typical

Zero-copy path (через AVHWFramesContext, planned v0.2) уберёт CPU copy — <10 ms end-to-end в идеале.

Reproducibility

Все benchmarks воспроизводимы из repo:

# Stress test
cd build && cmake -DBUILD_TESTING=ON ..  && cmake --build . && ctest -R stress -V

# E2E single consumer
./tools/cuframes-rtsp-source --rtsp rtsp://... --key cam1 --ring 6 --verbose &
./examples/sub_count --key cam1 --max-frames 100 --verbose

Production деplo замеры — см. интеграционные guides: