Files

T

gx a3ba3a95b2 docs: ROADMAP + CHANGELOG v0.1.0 + BENCHMARKS

- ROADMAP.md: structured v0.1✅ / v0.2📋 (encoded packet sharing + FFmpeg
  upstream PR + scale-cuda alt) / v0.3 (Python bindings, Jetson, multi-GPU)
  / v1.0 (stable ABI)
- CHANGELOG.md: full v0.1.0 release notes — features, tested config,
  production deployment, known limitations
- BENCHMARKS.md: measurements (stress 1×pub×4×sub, E2E real camera,
  prod multi-consumer 24h, VRAM cost per resolution, cuframes vs N×NVDEC)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-18 21:11:37 +01:00

4.9 KiB

Raw Blame History

Benchmarks

Все измерения проведены на reference hardware (см. ниже). Числа repeatable, voluntarily reproducible через libcuframes/tests/test_stress.cu и tools/cuframes-rtsp-source + examples/sub_count.

Reference hardware

Компонент	Значение
GPU	NVIDIA RTX 5090 (Blackwell, sm_120, 32 GB VRAM, 3× NVDEC, 1× NVENC)
CPU	AMD Ryzen 9 7950X (16 cores, 32 threads)
RAM	64 GB DDR5-6000
OS	Ubuntu 24.04 (kernel 6.17, glibc 2.39)
CUDA	Driver 555+, Toolkit 12.x / 13.x
PCIe	Gen5 ×16 (GPU connection)

Stress test — 1×publisher × 4×consumer × 2000 frames

Запуск: libcuframes/tests/test_stress.cu — fork-based, 1 publisher + 4 consumers (2 fast + 2 slow @ 5 ms sleep) на 1280×720 NV12 frames @ ~120 fps target.

Метрика	Значение
Frames per consumer	2000 / 2000
Gaps (lost seq)	0 у всех 4 consumers
Torn frames (verified `verify_y` kernel)	0 у всех 4 consumers
Wall time	18.8 s
Effective publisher rate	~106 fps (sub-real-time из-за slow consumers)

E2E real camera — 1 publisher + 1 consumer

Camera: Dahua HFW3441 main-stream 1920×1080 HEVC 25 fps, sub-stream 640×480 25 fps. Publisher: cuframes-rtsp-source (host build), consumer: examples/sub_count либо test_cuframes_source (cctv-processor's CuframesSource).

Метрика	NV12 1920×1080	NV12 640×480
Frame size (packed)	3,110,400 bytes (~3 MB)	460,800 bytes
Effective bandwidth	75 MB/s	11 MB/s
Publisher decode rate	25.03 fps (matches camera)	25.00 fps
Consumer receive rate	25.03 fps	25.34 fps
100-frame test	0 drops, 0 gaps	0 drops, 0 gaps

Production: 1× publisher → N× consumers (Frigate + cctv-backend)

Реальный production setup (24+ часов uptime):

Publisher: cuframes-pub-parking — Dahua 192.168.88.98 sub-stream 640×480 HEVC 25 fps
Consumer 1: Frigate 0.17.1 через FFmpeg cuframes:// demuxer (detect path; ONNX object detection)
Consumer 2: cctv-backend через C++ CuframesSource (motion detect + grid composer + RTSP encode → TV)

Метрика	Значение
Total NVDEC operations	1 (только у publisher'а)
Без cuframes была бы	2 (Frigate detect + cctv-backend detect)
GPU encoder	1× (cctv-backend H.264 encode для RTSP output)
Publisher VRAM ring	6 buffers × 460 KB ≈ 2.8 MB (sub-stream)
Frigate detect drops	0 over 24h
cctv-backend frame loss	0 over 24h

VRAM cost — NV12 ring buffer

Размер ring = frame_size × ring_size. Frame size NV12 = width × height × 1.5.

Resolution	Frame size	Ring 6 buffers
640×480	460 KB	2.8 MB
1280×720	1.35 MB	8.1 MB
1920×1080 (FHD)	3 MB	18 MB
2560×1440	5.4 MB	33 MB
2688×1520 (Dahua 4MP)	6 MB	36 MB
3840×2160 (4K)	12 MB	72 MB

Для 16-камерного setup на RTX 5090 (32 GB VRAM) — все FHD-камеры с ring=6 = ~288 MB total. <1% от доступной VRAM.

Сравнение: cuframes vs traditional N×NVDEC

Сценарий: 16 камер × 25 fps × 3 consumers (Frigate, cctv-processor, AI-pipeline).

Подход	NVDEC ops/sec	VRAM bandwidth (decoded path)
Without cuframes	16 × 25 × 3 = 1200	≥ 1200 × 6 MB = 7.2 GB/s
With cuframes (v0.1)	16 × 25 × 1 = 400	≥ 16 × 25 × 6 MB = 2.4 GB/s
Экономия	3× меньше NVDEC	3× меньше memory bw

NVDEC throughput limit на RTX 5090: ~50 концурentных FHD25-стримов. Без cuframes 3 consumers × 16 cam = занимает ~96% capacity → насыщение. С cuframes — ~32% → reserve для масштаба.

Latency

Hop	Latency
RTSP → publisher demuxer	sub-frame (<40 ms FHD25)
NVDEC decode	~3-5 ms на frame
publish_external → consumer receive	<0.5 ms (cudaEventRecord → cudaStreamWaitEvent)
consumer cudaMemcpy NV12 → host (FFmpeg demuxer v1)	~2-3 ms FHD
End-to-end RTSP → consumer frame ready	~50-100 ms typical

Zero-copy path (через AVHWFramesContext, planned v0.2) уберёт CPU copy — <10 ms end-to-end в идеале.

Reproducibility

Все benchmarks воспроизводимы из repo:

# Stress test
cd build && cmake -DBUILD_TESTING=ON ..  && cmake --build . && ctest -R stress -V

# E2E single consumer
./tools/cuframes-rtsp-source --rtsp rtsp://... --key cam1 --ring 6 --verbose &
./examples/sub_count --key cam1 --max-frames 100 --verbose

Production деplo замеры — см. интеграционные guides:

docs/integration.md — cctv-processor C++ pipeline
filter/README.md — FFmpeg demuxer (Frigate setup)

4.9 KiB Raw Blame History Unescape Escape