diff --git a/docs/benchmarks-phase0.md b/docs/benchmarks-phase0.md new file mode 100644 index 0000000..2817b47 --- /dev/null +++ b/docs/benchmarks-phase0.md @@ -0,0 +1,157 @@ +# Phase 0 — Benchmark results + +CUDA IPC ping-pong measurements. Validates что концепт работает и +обеспечивает достаточную производительность перед инвестированием в +полную реализацию (Phase 1+). + +Raw output каждого run сохранён в `docs/measurements/*.log` — все числа +ниже взяты оттуда. + +## Hardware + +- GPU: **NVIDIA RTX 5090** (Blackwell, sm_120, 32 GB VRAM) +- Driver: 595.58.03 +- CUDA Toolkit: 13.0.88 (внутри dev-container) +- Host OS: Ubuntu 24.04 (host) / Ubuntu 24.04 (container) +- Docker: 29.1.3, nvidia-container-runtime + +## Methodology + +- Producer аллоцирует 2 CUDA-буфера (NV12 FullHD, 1920×1080 = 3.1 MB/frame), + экспортирует `cudaIpcMemHandle_t`, заполняет Y-plane monotonic pattern + (`seq % 256`), публикует с целевым FPS +- Consumer открывает handles, читает frames через `cudaIpcOpenMemHandle`, + верифицирует содержимое первой строки Y-plane (no torn frames check), + замеряет wall-clock latency producer→consumer + +Запуск в одном Docker-контейнере (`cuframes-dev`), `ipc: shareable`, +`shm_size: 2gb`. Источник: `tools/spike/`. + +## Scenario 1 — Basic (1 producer × 1 consumer) + +Raw: `docs/measurements/phase0-consumer-basic.log` + +``` +producer: --fps 30 --duration 30 +consumer: --count 500 +``` + +| Metric | Value | +|---|---| +| Frames received | 500 | +| Duration | 16.63 s | +| Effective fps | 30.06 | +| Frames skipped (producer ahead) | **0** | +| Torn frames (data corruption) | **0** | +| Zero-copy verified | ✓ | +| Latency mean | 79 µs | +| **Latency p50** | **75 µs** | +| **Latency p95** | **146 µs** | +| **Latency p99** | **152 µs** | +| Latency max | 2273 µs (cold-start outlier) | + +## Scenario 2 — Multi-consumer (1 producer × 3 consumers) + +Raw: `docs/measurements/phase0-consumer-multi-c{1,2,3}.log` + +``` +producer: --fps 30 --duration 60 +3× consumer (параллельно): --count 300 +``` + +| Metric | C1 | C2 | C3 | +|---|---|---|---| +| Frames received | 300 | 300 | 300 | +| Effective fps | 30.17 | 30.16 | 30.16 | +| Frames skipped | **0** | **0** | **0** | +| Torn frames | **0** | **0** | **0** | +| Latency mean | 152 µs | 139 µs | 146 µs | +| Latency p50 | 80 µs | 72 µs | 73 µs | +| Latency p95 | 145 µs | 144 µs | 144 µs | +| **Latency p99** | **152 µs** | **151 µs** | **152 µs** | +| Latency max | 22.3 ms | 19.6 ms | 21.2 ms (cold-start) | + +**Key findings:** +- Latency p99 одинаковая для всех 3 consumers (151-152 µs) — proof что + zero-copy реально работает (нет contention, нет per-consumer overhead) +- Все consumers получили **все** свои 300 frames без skip +- Zero corruption — `cudaStreamSynchronize` на producer-стороне достаточен + для consistency + +## Acceptance criteria — RESULT + +| Criterion | Target | Actual | Status | +|---|---|---|---| +| p99 latency | < 5 ms | **152 µs (0.15 ms)** | ✅ 33× ниже target | +| Multi-consumer scalability | без degradation | identical p99 для всех 3 | ✅ | +| Torn frames | 0 | 0 | ✅ | +| Frame skip | 0 | 0 | ✅ | +| Memory leak (short run) | 0 | not measured ⏳ | ⏳ Phase 0.5 | + +## Что осталось из Phase 0 (отложено в Phase 1 tests) + +- Cross-container test: producer в одном docker container, consumer в + другом, через `ipc:container:`. Validates production scenario. +- 1-hour stress run, мониторинг VRAM/RSS на leak +- CUDA IPC events sync — попробовать `cudaIpcEventHandle_t` вместо + `cudaStreamSynchronize` для overlap producer-decode и consumer-read + +Эти scenarios менее критичны (basic дизайн validated) — переносим в +Phase 1, где будут реальные libcuframes API tests. + +## Решение по design + +CUDA IPC sync через **`cudaStreamSynchronize`** достаточен для v0.1. +Latency 75-152 µs (p50-p99) — well under target 5 ms. Усложнять API +event-handles на этом этапе не нужно. + +## Готовность к Phase 1 + +✅ **GREEN** — продолжаем по плану из `docs/architecture.md` §5. + +Phase 1 цели: +- libcuframes producer/consumer C API +- Wire-protocol (Unix socket handshake) +- N-slot ring buffer (вместо 2) +- ACK protocol для proper backpressure +- Unit tests + stress (24h leak detection) + +## Reproduction + +```bash +# Сборка dev-контейнера (одноразово) +docker compose -f docker/docker-compose.dev.yml up -d --build + +# Сборка spike +docker exec cuframes-dev bash -c " + cd /workspace + cmake -B build -S tools/spike -G Ninja + cmake --build build +" + +# Basic +docker exec cuframes-dev bash -c " + cd /workspace + ./build/pingpong_producer --key smoke --fps 30 --duration 30 & + sleep 1 + ./build/pingpong_consumer --key smoke --count 500 +" + +# Multi-consumer +docker exec cuframes-dev bash -c " + cd /workspace + ./build/pingpong_producer --key multi --fps 30 --duration 60 & + sleep 1 + ./build/pingpong_consumer --key multi --count 300 > /tmp/c1.log 2>&1 & + ./build/pingpong_consumer --key multi --count 300 > /tmp/c2.log 2>&1 & + ./build/pingpong_consumer --key multi --count 300 + wait +" +``` + +## Артефакты + +- Source: `tools/spike/` (commit `604cffb`) +- Container: `cuframes-dev:latest` 9 GB (commit `6962bc3`) +- Raw measurement logs: `docs/measurements/phase0-consumer-*.log` +- Generated 2026-05-14 diff --git a/docs/measurements/phase0-consumer-basic.log b/docs/measurements/phase0-consumer-basic.log new file mode 100644 index 0000000..6d54006 --- /dev/null +++ b/docs/measurements/phase0-consumer-basic.log @@ -0,0 +1,19 @@ +[consumer] key=b1 count=500 +[consumer] connected, frame 1920x1080 +[consumer] mapped 2 slots into VRAM + +=== cuframes spike summary === +frames received: 500 +duration: 16.6312 s +effective fps: 30.0639 +frames skipped (producer ahead): 0 +torn frames (data corrupt): 0 + +latency producer→consumer (microseconds): + mean: 79 us + p50: 75 us + p95: 146 us + p99: 152 us + max: 2273 us + +zero-copy: ✓ diff --git a/docs/measurements/phase0-consumer-multi-c1.log b/docs/measurements/phase0-consumer-multi-c1.log new file mode 100644 index 0000000..cfb1e8b --- /dev/null +++ b/docs/measurements/phase0-consumer-multi-c1.log @@ -0,0 +1,19 @@ +[consumer] key=m1 count=300 +[consumer] connected, frame 1920x1080 +[consumer] mapped 2 slots into VRAM + +=== cuframes spike summary === +frames received: 300 +duration: 9.94438 s +effective fps: 30.1678 +frames skipped (producer ahead): 0 +torn frames (data corrupt): 0 + +latency producer→consumer (microseconds): + mean: 152 us + p50: 80 us + p95: 145 us + p99: 152 us + max: 22321 us + +zero-copy: ✓ diff --git a/docs/measurements/phase0-consumer-multi-c2.log b/docs/measurements/phase0-consumer-multi-c2.log new file mode 100644 index 0000000..4c3e11e --- /dev/null +++ b/docs/measurements/phase0-consumer-multi-c2.log @@ -0,0 +1,19 @@ +[consumer] key=m1 count=300 +[consumer] connected, frame 1920x1080 +[consumer] mapped 2 slots into VRAM + +=== cuframes spike summary === +frames received: 300 +duration: 9.9472 s +effective fps: 30.1593 +frames skipped (producer ahead): 0 +torn frames (data corrupt): 0 + +latency producer→consumer (microseconds): + mean: 139 us + p50: 72 us + p95: 144 us + p99: 151 us + max: 19598 us + +zero-copy: ✓ diff --git a/docs/measurements/phase0-consumer-multi-c3.log b/docs/measurements/phase0-consumer-multi-c3.log new file mode 100644 index 0000000..a03d7a4 --- /dev/null +++ b/docs/measurements/phase0-consumer-multi-c3.log @@ -0,0 +1,19 @@ +[consumer] key=m1 count=300 +[consumer] connected, frame 1920x1080 +[consumer] mapped 2 slots into VRAM + +=== cuframes spike summary === +frames received: 300 +duration: 9.94555 s +effective fps: 30.1642 +frames skipped (producer ahead): 0 +torn frames (data corrupt): 0 + +latency producer→consumer (microseconds): + mean: 146 us + p50: 73 us + p95: 144 us + p99: 152 us + max: 21240 us + +zero-copy: ✓ diff --git a/docs/measurements/phase0-producer-basic.log b/docs/measurements/phase0-producer-basic.log new file mode 100644 index 0000000..e69de29 diff --git a/docs/measurements/phase0-producer-multi.log b/docs/measurements/phase0-producer-multi.log new file mode 100644 index 0000000..e69de29