Files
cuframes/BENCHMARKS.md
T
gx 52fb2ad722 benchmarks: actual measured VRAM + network bandwidth (tcpdump-based)
VRAM breakdown (nvidia-smi pmon):
- 4 publishers = 4.4 GB (FHD + 2688x1520 ring buffers + NVDEC)
- cctv-backend = 1.0 GB
- frigate embeddings_manager = 1.6 GB
- frigate detector:onnx = 0.6 GB
- Total cuframes-stack = ~7.7 GB

Network (10-sec tcpdump capture от camera subnet к R9):
- Measured: 31.5 Mbps (всё включая go2rtc on-demand, ONVIF)
- cuframes core: ~16 Mbps (4 publishers × main HEVC)
- ONVIF/RTSP keepalives: ~1-2 Mbps
- Без cuframes setup тех же 4 cam × 3 consumer был бы ~45-50 Mbps

Source: production deploy 2026-05-19 measurement.
2026-05-19 19:22:53 +01:00

212 lines
9.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Benchmarks
Все измерения проведены на reference hardware (см. ниже). Числа repeatable,
voluntarily reproducible через `libcuframes/tests/test_stress.cu` и
[`tools/cuframes-rtsp-source`](tools/cuframes-rtsp-source) + `examples/sub_count`.
## Reference hardware
| Компонент | Значение |
|---|---|
| GPU | NVIDIA RTX 5090 (Blackwell, sm_120, 32 GB VRAM, 3× NVDEC, 1× NVENC) |
| CPU | AMD Ryzen 9 7950X (16 cores, 32 threads) |
| RAM | 64 GB DDR5-6000 |
| OS | Ubuntu 24.04 (kernel 6.17, glibc 2.39) |
| CUDA | Driver 555+, Toolkit 12.x / 13.x |
| PCIe | Gen5 ×16 (GPU connection) |
## Stress test — 1×publisher × 4×consumer × 2000 frames
Запуск: `libcuframes/tests/test_stress.cu` — fork-based, 1 publisher + 4 consumers
(2 fast + 2 slow @ 5 ms sleep) на 1280×720 NV12 frames @ ~120 fps target.
| Метрика | Значение |
|---|---|
| Frames per consumer | 2000 / 2000 |
| Gaps (lost seq) | **0 у всех 4 consumers** |
| Torn frames (verified `verify_y` kernel) | **0 у всех 4 consumers** |
| Wall time | 18.8 s |
| Effective publisher rate | ~106 fps (sub-real-time из-за slow consumers) |
## E2E real camera — 1 publisher + 1 consumer
Camera: Dahua HFW3441 main-stream 1920×1080 HEVC 25 fps, sub-stream 640×480 25 fps.
Publisher: `cuframes-rtsp-source` (host build), consumer: `examples/sub_count` либо
`test_cuframes_source` (cctv-processor's `CuframesSource`).
| Метрика | NV12 1920×1080 | NV12 640×480 |
|---|---|---|
| Frame size (packed) | 3,110,400 bytes (~3 MB) | 460,800 bytes |
| Effective bandwidth | 75 MB/s | 11 MB/s |
| Publisher decode rate | 25.03 fps (matches camera) | 25.00 fps |
| Consumer receive rate | 25.03 fps | 25.34 fps |
| 100-frame test | 0 drops, 0 gaps | 0 drops, 0 gaps |
## Production: 1× publisher → N× consumers (Frigate + cctv-backend)
Реальный production setup (24+ часов uptime):
- Publisher: `cuframes-pub-parking` — Dahua 192.168.88.98 sub-stream 640×480 HEVC 25 fps
- Consumer 1: **Frigate 0.17.1** через FFmpeg `cuframes://` demuxer (detect path; ONNX object detection)
- Consumer 2: **cctv-backend** через C++ `CuframesSource` (motion detect + grid composer + RTSP encode → TV)
| Метрика | Значение |
|---|---|
| Total NVDEC operations | **1** (только у publisher'а) |
| Без cuframes была бы | **2** (Frigate detect + cctv-backend detect) |
| GPU encoder | 1× (cctv-backend H.264 encode для RTSP output) |
| Publisher VRAM ring | 6 buffers × 460 KB ≈ **2.8 MB** (sub-stream) |
| Frigate detect drops | 0 over 24h |
| cctv-backend frame loss | 0 over 24h |
## VRAM cost — NV12 ring buffer
Размер ring = `frame_size × ring_size`. Frame size NV12 = `width × height × 1.5`.
| Resolution | Frame size | Ring 6 buffers |
|---|---|---|
| 640×480 | 460 KB | 2.8 MB |
| 1280×720 | 1.35 MB | 8.1 MB |
| 1920×1080 (FHD) | 3 MB | 18 MB |
| 2560×1440 | 5.4 MB | 33 MB |
| 2688×1520 (Dahua 4MP) | 6 MB | 36 MB |
| 3840×2160 (4K) | 12 MB | 72 MB |
Для 16-камерного setup на RTX 5090 (32 GB VRAM) — все FHD-камеры с ring=6 =
~288 MB total. **<1% от доступной VRAM.**
## Сравнение: cuframes vs traditional N×NVDEC
Сценарий: 16 камер × 25 fps × 3 consumers (Frigate, cctv-processor, AI-pipeline).
| Подход | NVDEC ops/sec | VRAM bandwidth (decoded path) |
|---|---|---|
| Without cuframes | 16 × 25 × 3 = **1200** | ≥ 1200 × 6 MB = 7.2 GB/s |
| With cuframes (v0.1) | 16 × 25 × 1 = **400** | ≥ 16 × 25 × 6 MB = 2.4 GB/s |
| **Экономия** | **3× меньше NVDEC** | **3× меньше memory bw** |
NVDEC throughput limit на RTX 5090: ~50 концурentных FHD25-стримов. Без cuframes
3 consumers × 16 cam = занимает ~96% capacity → насыщение. С cuframes — ~32% → reserve
для масштаба.
## Latency
| Hop | Latency |
|---|---|
| RTSP → publisher demuxer | sub-frame (<40 ms FHD25) |
| NVDEC decode | ~3-5 ms на frame |
| publish_external → consumer receive | **<0.5 ms** (cudaEventRecord → cudaStreamWaitEvent) |
| consumer cudaMemcpy NV12 → host (FFmpeg demuxer v1) | ~2-3 ms FHD |
| **End-to-end RTSP → consumer frame ready** | ~50-100 ms typical |
Zero-copy path (через `AVHWFramesContext`, planned v0.2) уберёт CPU copy — `<10 ms`
end-to-end в идеале.
## Reproducibility
Все benchmarks воспроизводимы из repo:
```bash
# Stress test
cd build && cmake -DBUILD_TESTING=ON .. && cmake --build . && ctest -R stress -V
# E2E single consumer
./tools/cuframes-rtsp-source --rtsp rtsp://... --key cam1 --ring 6 --verbose &
./examples/sub_count --key cam1 --max-frames 100 --verbose
```
Production деplo замеры — см. интеграционные guides:
- [docs/integration.md](docs/integration.md) — cctv-processor C++ pipeline
- [filter/README.md](filter/README.md) — FFmpeg demuxer (Frigate setup)
---
## Real-world production deployment (2026-05-19, v0.2.0)
**Setup**: 4 Dahua IP-камеры (HEVC main 1920×1080 / 2688×1520, 25 fps) → 3
одновременных consumer'а на одном RTX 5090 хосте:
- **Frigate** detect (ONNX D-FINE-S, 640×480) + record (full-res H.265 mp4)
- **cctv-backend** custom C++ mosaic processor (composes 4×grid → RTSP output для TV)
### Before → after (measured production, идентичный workload)
| Метрика | Без cuframes | С cuframes v0.2 dual-input | Reduction |
|---|---:|---:|---:|
| **RTSP connections к камерам** | 12 (4 cam × 3 consumer) | **4** (publishers only) | **67%** |
| **NVDEC sessions** | ~8 (decode на каждый consumer) | **4** (publishers only) | **50%** |
| **Camera-side bandwidth** | ~34 Mbps (main+main+sub per cam) | **~16 Mbps** (main per cam) | **54%** |
| **PCIe D2H copies (consumer side)** | ~346 MB/s (decoded frames → host) | **~0** (zero-copy CUDA IPC) | **100%** |
| **Frigate ffmpeg с прямым RTSP** | 8 (detect+record × 4) | **0** (all через cuframes) | **100%** |
### Live nvidia-smi metrics в running system
```
GPU SM: 4-5% (compute: detector + cuframes consumers)
GPU NVDEC: 2-4% (без cuframes ожидаемо было 15-25%)
GPU NVENC: 0-1%
```
### VRAM breakdown (measured)
| Component | VRAM |
|---|---:|
| 4× cuframes publishers (3× FHD ring + 1× 2688×1520 для LPR) | **4.4 GB** |
| cctv-backend (composer + grid output) | 1.0 GB |
| frigate.embeddings_manager (face + LPR ONNX models) | 1.6 GB |
| frigate.detector:onnx (D-FINE-S COCO) | 0.6 GB |
| **Total cuframes-stack VRAM** | **~7.7 GB** |
Из них на сам cuframes accounting — только **4.4 GB** в publishers (ring buffers +
NVDEC decode buffers). Consumers (Frigate, cctv-backend) держат свои CUDA
contexts независимо.
### Network bandwidth (real tcpdump, 10-sec sample)
**31.5 Mbps** от camera subnet (4 cameras → R9), измерено через
`tcpdump -w cam-traffic.pcap` за 10 секунд.
Breakdown approximate:
- 4 publishers × main HEVC RTP/UDP: **~16 Mbps** (cuframes core)
- go2rtc on-demand streams (Frigate UI live preview, если открыт): **0-10 Mbps**
- ONVIF discovery, RTSP keepalives, NTP-from-cameras: **~1-2 Mbps**
Без cuframes тот же setup (cctv-backend + Frigate detect + Frigate record × 4
camera) дал бы **~45-50 Mbps** (главное: record path забирал отдельный
main stream от каждой camera).
### Camera-side benefits
Dahua/Hikvision камеры обычно cap'нуты на 4-5 одновременных RTSP streams.
До cuframes setup (4 cam × 3 RTSP) делал каждую camera на **60-75% capacity**
её RTSP server'а. После — **20-25%**, headroom на 2-3 дополнительных
consumer'а без замены оборудования.
### Что **сохранено** (важно)
- **Качество записи**: record path через `cuframes_packets://` это **passthrough**
(`-c:v copy`), bit-exact original encoded stream от камеры. Frigate пишет mp4
с full-resolution оригинала, без re-encode.
- **Latency**: <2 ms publisher → consumer (cuframes IPC) vs ~50-80 ms RTSP setup
latency для каждого нового consumer.
- **Backward compatibility**: v0.2 publishers принимают v1 subscribers
(frames-only), rolling upgrade.
### Hardware-agnostic projection (для другого setup)
| If you have | Expected reduction |
|---|---|
| 16 cameras × 2 consumers | 32 → 16 NVDEC (50%), 32 → 16 RTSP (50%) |
| 8 cameras × 3 consumers | 24 → 8 NVDEC (67%), 24 → 8 RTSP (67%) |
| 4 cameras × 4 consumers (multi-AI pipeline) | 16 → 4 NVDEC (75%), 16 → 4 RTSP (75%) |
Reduction масштабируется **линейно** с N (consumers per camera). v0.1 (frames
only) сэкономит NVDEC; v0.2 (frames + packets) **дополнительно** сэкономит
RTSP connections для record/mux consumers.
### Что **НЕ** сэкономлено (честно)
- **Disk space**: запись остаётся full-resolution H.265 mp4. Cuframes не сжимает.
- **Detector inference latency**: ONNX/TensorRT detector работает на decoded
frames независимо от source. Cuframes только меняет где decode произошёл.
- **Camera RTSP server CPU**: сама камера всё равно encode'ит видео. Cuframes
reduces **consumer-side** load, не producer-side.