Files
cuframes/BENCHMARKS.md
T
gx a3ba3a95b2 docs: ROADMAP + CHANGELOG v0.1.0 + BENCHMARKS
- ROADMAP.md: structured v0.1 / v0.2📋 (encoded packet sharing + FFmpeg
  upstream PR + scale-cuda alt) / v0.3 (Python bindings, Jetson, multi-GPU)
  / v1.0 (stable ABI)
- CHANGELOG.md: full v0.1.0 release notes — features, tested config,
  production deployment, known limitations
- BENCHMARKS.md: measurements (stress 1×pub×4×sub, E2E real camera,
  prod multi-consumer 24h, VRAM cost per resolution, cuframes vs N×NVDEC)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 21:11:37 +01:00

120 lines
4.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Benchmarks
Все измерения проведены на reference hardware (см. ниже). Числа repeatable,
voluntarily reproducible через `libcuframes/tests/test_stress.cu` и
[`tools/cuframes-rtsp-source`](tools/cuframes-rtsp-source) + `examples/sub_count`.
## Reference hardware
| Компонент | Значение |
|---|---|
| GPU | NVIDIA RTX 5090 (Blackwell, sm_120, 32 GB VRAM, 3× NVDEC, 1× NVENC) |
| CPU | AMD Ryzen 9 7950X (16 cores, 32 threads) |
| RAM | 64 GB DDR5-6000 |
| OS | Ubuntu 24.04 (kernel 6.17, glibc 2.39) |
| CUDA | Driver 555+, Toolkit 12.x / 13.x |
| PCIe | Gen5 ×16 (GPU connection) |
## Stress test — 1×publisher × 4×consumer × 2000 frames
Запуск: `libcuframes/tests/test_stress.cu` — fork-based, 1 publisher + 4 consumers
(2 fast + 2 slow @ 5 ms sleep) на 1280×720 NV12 frames @ ~120 fps target.
| Метрика | Значение |
|---|---|
| Frames per consumer | 2000 / 2000 |
| Gaps (lost seq) | **0 у всех 4 consumers** |
| Torn frames (verified `verify_y` kernel) | **0 у всех 4 consumers** |
| Wall time | 18.8 s |
| Effective publisher rate | ~106 fps (sub-real-time из-за slow consumers) |
## E2E real camera — 1 publisher + 1 consumer
Camera: Dahua HFW3441 main-stream 1920×1080 HEVC 25 fps, sub-stream 640×480 25 fps.
Publisher: `cuframes-rtsp-source` (host build), consumer: `examples/sub_count` либо
`test_cuframes_source` (cctv-processor's `CuframesSource`).
| Метрика | NV12 1920×1080 | NV12 640×480 |
|---|---|---|
| Frame size (packed) | 3,110,400 bytes (~3 MB) | 460,800 bytes |
| Effective bandwidth | 75 MB/s | 11 MB/s |
| Publisher decode rate | 25.03 fps (matches camera) | 25.00 fps |
| Consumer receive rate | 25.03 fps | 25.34 fps |
| 100-frame test | 0 drops, 0 gaps | 0 drops, 0 gaps |
## Production: 1× publisher → N× consumers (Frigate + cctv-backend)
Реальный production setup (24+ часов uptime):
- Publisher: `cuframes-pub-parking` — Dahua 192.168.88.98 sub-stream 640×480 HEVC 25 fps
- Consumer 1: **Frigate 0.17.1** через FFmpeg `cuframes://` demuxer (detect path; ONNX object detection)
- Consumer 2: **cctv-backend** через C++ `CuframesSource` (motion detect + grid composer + RTSP encode → TV)
| Метрика | Значение |
|---|---|
| Total NVDEC operations | **1** (только у publisher'а) |
| Без cuframes была бы | **2** (Frigate detect + cctv-backend detect) |
| GPU encoder | 1× (cctv-backend H.264 encode для RTSP output) |
| Publisher VRAM ring | 6 buffers × 460 KB ≈ **2.8 MB** (sub-stream) |
| Frigate detect drops | 0 over 24h |
| cctv-backend frame loss | 0 over 24h |
## VRAM cost — NV12 ring buffer
Размер ring = `frame_size × ring_size`. Frame size NV12 = `width × height × 1.5`.
| Resolution | Frame size | Ring 6 buffers |
|---|---|---|
| 640×480 | 460 KB | 2.8 MB |
| 1280×720 | 1.35 MB | 8.1 MB |
| 1920×1080 (FHD) | 3 MB | 18 MB |
| 2560×1440 | 5.4 MB | 33 MB |
| 2688×1520 (Dahua 4MP) | 6 MB | 36 MB |
| 3840×2160 (4K) | 12 MB | 72 MB |
Для 16-камерного setup на RTX 5090 (32 GB VRAM) — все FHD-камеры с ring=6 =
~288 MB total. **<1% от доступной VRAM.**
## Сравнение: cuframes vs traditional N×NVDEC
Сценарий: 16 камер × 25 fps × 3 consumers (Frigate, cctv-processor, AI-pipeline).
| Подход | NVDEC ops/sec | VRAM bandwidth (decoded path) |
|---|---|---|
| Without cuframes | 16 × 25 × 3 = **1200** | ≥ 1200 × 6 MB = 7.2 GB/s |
| With cuframes (v0.1) | 16 × 25 × 1 = **400** | ≥ 16 × 25 × 6 MB = 2.4 GB/s |
| **Экономия** | **3× меньше NVDEC** | **3× меньше memory bw** |
NVDEC throughput limit на RTX 5090: ~50 концурentных FHD25-стримов. Без cuframes
3 consumers × 16 cam = занимает ~96% capacity → насыщение. С cuframes — ~32% → reserve
для масштаба.
## Latency
| Hop | Latency |
|---|---|
| RTSP → publisher demuxer | sub-frame (<40 ms FHD25) |
| NVDEC decode | ~3-5 ms на frame |
| publish_external → consumer receive | **<0.5 ms** (cudaEventRecord → cudaStreamWaitEvent) |
| consumer cudaMemcpy NV12 → host (FFmpeg demuxer v1) | ~2-3 ms FHD |
| **End-to-end RTSP → consumer frame ready** | ~50-100 ms typical |
Zero-copy path (через `AVHWFramesContext`, planned v0.2) уберёт CPU copy — `<10 ms`
end-to-end в идеале.
## Reproducibility
Все benchmarks воспроизводимы из repo:
```bash
# Stress test
cd build && cmake -DBUILD_TESTING=ON .. && cmake --build . && ctest -R stress -V
# E2E single consumer
./tools/cuframes-rtsp-source --rtsp rtsp://... --key cam1 --ring 6 --verbose &
./examples/sub_count --key cam1 --max-frames 100 --verbose
```
Production деplo замеры — см. интеграционные guides:
- [docs/integration.md](docs/integration.md) — cctv-processor C++ pipeline
- [filter/README.md](filter/README.md) — FFmpeg demuxer (Frigate setup)