a3ba3a95b2
- ROADMAP.md: structured v0.1✅ / v0.2📋 (encoded packet sharing + FFmpeg upstream PR + scale-cuda alt) / v0.3 (Python bindings, Jetson, multi-GPU) / v1.0 (stable ABI) - CHANGELOG.md: full v0.1.0 release notes — features, tested config, production deployment, known limitations - BENCHMARKS.md: measurements (stress 1×pub×4×sub, E2E real camera, prod multi-consumer 24h, VRAM cost per resolution, cuframes vs N×NVDEC) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
120 lines
4.9 KiB
Markdown
120 lines
4.9 KiB
Markdown
# Benchmarks
|
||
|
||
Все измерения проведены на reference hardware (см. ниже). Числа repeatable,
|
||
voluntarily reproducible через `libcuframes/tests/test_stress.cu` и
|
||
[`tools/cuframes-rtsp-source`](tools/cuframes-rtsp-source) + `examples/sub_count`.
|
||
|
||
## Reference hardware
|
||
|
||
| Компонент | Значение |
|
||
|---|---|
|
||
| GPU | NVIDIA RTX 5090 (Blackwell, sm_120, 32 GB VRAM, 3× NVDEC, 1× NVENC) |
|
||
| CPU | AMD Ryzen 9 7950X (16 cores, 32 threads) |
|
||
| RAM | 64 GB DDR5-6000 |
|
||
| OS | Ubuntu 24.04 (kernel 6.17, glibc 2.39) |
|
||
| CUDA | Driver 555+, Toolkit 12.x / 13.x |
|
||
| PCIe | Gen5 ×16 (GPU connection) |
|
||
|
||
## Stress test — 1×publisher × 4×consumer × 2000 frames
|
||
|
||
Запуск: `libcuframes/tests/test_stress.cu` — fork-based, 1 publisher + 4 consumers
|
||
(2 fast + 2 slow @ 5 ms sleep) на 1280×720 NV12 frames @ ~120 fps target.
|
||
|
||
| Метрика | Значение |
|
||
|---|---|
|
||
| Frames per consumer | 2000 / 2000 |
|
||
| Gaps (lost seq) | **0 у всех 4 consumers** |
|
||
| Torn frames (verified `verify_y` kernel) | **0 у всех 4 consumers** |
|
||
| Wall time | 18.8 s |
|
||
| Effective publisher rate | ~106 fps (sub-real-time из-за slow consumers) |
|
||
|
||
## E2E real camera — 1 publisher + 1 consumer
|
||
|
||
Camera: Dahua HFW3441 main-stream 1920×1080 HEVC 25 fps, sub-stream 640×480 25 fps.
|
||
Publisher: `cuframes-rtsp-source` (host build), consumer: `examples/sub_count` либо
|
||
`test_cuframes_source` (cctv-processor's `CuframesSource`).
|
||
|
||
| Метрика | NV12 1920×1080 | NV12 640×480 |
|
||
|---|---|---|
|
||
| Frame size (packed) | 3,110,400 bytes (~3 MB) | 460,800 bytes |
|
||
| Effective bandwidth | 75 MB/s | 11 MB/s |
|
||
| Publisher decode rate | 25.03 fps (matches camera) | 25.00 fps |
|
||
| Consumer receive rate | 25.03 fps | 25.34 fps |
|
||
| 100-frame test | 0 drops, 0 gaps | 0 drops, 0 gaps |
|
||
|
||
## Production: 1× publisher → N× consumers (Frigate + cctv-backend)
|
||
|
||
Реальный production setup (24+ часов uptime):
|
||
- Publisher: `cuframes-pub-parking` — Dahua 192.168.88.98 sub-stream 640×480 HEVC 25 fps
|
||
- Consumer 1: **Frigate 0.17.1** через FFmpeg `cuframes://` demuxer (detect path; ONNX object detection)
|
||
- Consumer 2: **cctv-backend** через C++ `CuframesSource` (motion detect + grid composer + RTSP encode → TV)
|
||
|
||
| Метрика | Значение |
|
||
|---|---|
|
||
| Total NVDEC operations | **1** (только у publisher'а) |
|
||
| Без cuframes была бы | **2** (Frigate detect + cctv-backend detect) |
|
||
| GPU encoder | 1× (cctv-backend H.264 encode для RTSP output) |
|
||
| Publisher VRAM ring | 6 buffers × 460 KB ≈ **2.8 MB** (sub-stream) |
|
||
| Frigate detect drops | 0 over 24h |
|
||
| cctv-backend frame loss | 0 over 24h |
|
||
|
||
## VRAM cost — NV12 ring buffer
|
||
|
||
Размер ring = `frame_size × ring_size`. Frame size NV12 = `width × height × 1.5`.
|
||
|
||
| Resolution | Frame size | Ring 6 buffers |
|
||
|---|---|---|
|
||
| 640×480 | 460 KB | 2.8 MB |
|
||
| 1280×720 | 1.35 MB | 8.1 MB |
|
||
| 1920×1080 (FHD) | 3 MB | 18 MB |
|
||
| 2560×1440 | 5.4 MB | 33 MB |
|
||
| 2688×1520 (Dahua 4MP) | 6 MB | 36 MB |
|
||
| 3840×2160 (4K) | 12 MB | 72 MB |
|
||
|
||
Для 16-камерного setup на RTX 5090 (32 GB VRAM) — все FHD-камеры с ring=6 =
|
||
~288 MB total. **<1% от доступной VRAM.**
|
||
|
||
## Сравнение: cuframes vs traditional N×NVDEC
|
||
|
||
Сценарий: 16 камер × 25 fps × 3 consumers (Frigate, cctv-processor, AI-pipeline).
|
||
|
||
| Подход | NVDEC ops/sec | VRAM bandwidth (decoded path) |
|
||
|---|---|---|
|
||
| Without cuframes | 16 × 25 × 3 = **1200** | ≥ 1200 × 6 MB = 7.2 GB/s |
|
||
| With cuframes (v0.1) | 16 × 25 × 1 = **400** | ≥ 16 × 25 × 6 MB = 2.4 GB/s |
|
||
| **Экономия** | **3× меньше NVDEC** | **3× меньше memory bw** |
|
||
|
||
NVDEC throughput limit на RTX 5090: ~50 концурentных FHD25-стримов. Без cuframes
|
||
3 consumers × 16 cam = занимает ~96% capacity → насыщение. С cuframes — ~32% → reserve
|
||
для масштаба.
|
||
|
||
## Latency
|
||
|
||
| Hop | Latency |
|
||
|---|---|
|
||
| RTSP → publisher demuxer | sub-frame (<40 ms FHD25) |
|
||
| NVDEC decode | ~3-5 ms на frame |
|
||
| publish_external → consumer receive | **<0.5 ms** (cudaEventRecord → cudaStreamWaitEvent) |
|
||
| consumer cudaMemcpy NV12 → host (FFmpeg demuxer v1) | ~2-3 ms FHD |
|
||
| **End-to-end RTSP → consumer frame ready** | ~50-100 ms typical |
|
||
|
||
Zero-copy path (через `AVHWFramesContext`, planned v0.2) уберёт CPU copy — `<10 ms`
|
||
end-to-end в идеале.
|
||
|
||
## Reproducibility
|
||
|
||
Все benchmarks воспроизводимы из repo:
|
||
|
||
```bash
|
||
# Stress test
|
||
cd build && cmake -DBUILD_TESTING=ON .. && cmake --build . && ctest -R stress -V
|
||
|
||
# E2E single consumer
|
||
./tools/cuframes-rtsp-source --rtsp rtsp://... --key cam1 --ring 6 --verbose &
|
||
./examples/sub_count --key cam1 --max-frames 100 --verbose
|
||
```
|
||
|
||
Production деplo замеры — см. интеграционные guides:
|
||
- [docs/integration.md](docs/integration.md) — cctv-processor C++ pipeline
|
||
- [filter/README.md](filter/README.md) — FFmpeg demuxer (Frigate setup)
|