init

2026-05-26 23:23:25 +01:00
commit 7f45c36aa2
43 changed files with 25333 additions and 0 deletions
@@ -0,0 +1,7 @@
+{
+  "label": "Concepts",
+  "position": 3,
+  "collapsible": true,
+  "collapsed": false,
+  "link": null
+}
@@ -0,0 +1,118 @@
+---
+title: Frame ring vs Packet ring
+sidebar_position: 1
+---
+
+# Frame ring vs Packet ring
+
+cuframes даёт publisher'у два **независимых** ring buffer'а с разной семантикой и разной стоимостью:
+
+- **Frame ring** — decoded NV12 (или другой pixel format), shared zero-copy через CUDA VMM.
+- **Packet ring** — encoded H.264 / H.265 NAL units, shared через POSIX shared memory.
+
+Это два разных канала на одной паре publisher↔subscriber. Можно использовать один, оба, или иметь несколько consumer'ов где одни читают frame ring, другие — packet ring.
+
+## Зачем два канала
+
+Один публикатор обычно обслуживает разные классы потребителей. AI-детектор и GPU-композитор хотят уже decoded GPU pointer (frame ring); NVR-recorder и replay-сервис хотят compact encoded stream (packet ring). Заставлять recorder декодировать ради того чтобы потом снова enkode'нуть для сохранения — пустая работа.
+
+```mermaid
+flowchart LR
+  RTSP[RTSP camera] --> Dec[NVDEC decode]
+  Dec --> Pub[Publisher]
+  Pub -- frame ring<br/>CUDA VMM --> AI[AI inference<br/>GPU consumer]
+  Pub -- frame ring<br/>CUDA VMM --> Comp[GPU compositor<br/>CUDA filter]
+  Pub -- packet ring<br/>POSIX shm --> NVR[NVR / recorder<br/>mp4 mux]
+  Pub -- packet ring<br/>POSIX shm --> Replay[Replay / seek service]
+```
+
+## Frame ring
+
+**Что:** ring из N CUDA-allocated slot'ов (`cuMemCreate(POSIX_FILE_DESCRIPTOR)`), экспортируются через `SCM_RIGHTS` consumer'у, импортируются через `cuMemImportFromShareableHandle`. Consumer получает CUDA device pointer на ту же физическую HBM-память что и publisher.
+
+**Когда использовать:**
+
+- consumer работает на GPU и хочет данные as-is (AI inference, CUDA filter, NVENC re-encode на другом codec);
+- латентность критична — между publish и consume hardware coherence, без encode/decode roundtrip;
+- consumer декодировать сам не хочет.
+
+**Стоимость:** `ring_size × frame_size` GPU-памяти на publisher. Для NV12 1920×1080 ring=4 это ≈ 12 MiB на publisher (VMM granularity на RTX 5090 — 2 MiB, реально ≈ 16 MiB). Consumer'ы memory не платят — это та же физическая память.
+
+API:
+
+```c
+cuframes_publisher_create(&cfg, &pub);            // ring аллоцируется
+cuframes_publisher_acquire(pub, &cuda_ptr);       // получаем slot
+// ... NVDEC / kernel пишут в cuda_ptr ...
+cuframes_publisher_publish(pub, stream, pts_ns);
+```
+
+## Packet ring
+
+**Что:** ring slot'ов с metadata (pts, dts, size, flags) + отдельный data section (default 8 MiB) в POSIX shm `/dev/shm/cuframes-<key>-packets`. Publisher закидывает туда encoded NAL units (Annex B byte stream).
+
+**Когда использовать:**
+
+- consumer декодирует сам (FFmpeg demuxer, recorder, на остальном GPU нет места);
+- нужен compact stream для записи на диск;
+- late subscriber должен сам resync'нуться от ближайшего keyframe — это семантика ring'а (см. [protocol reference](/docs/reference/protocol)).
+
+**Стоимость:** POSIX shm на host — `data_size + ring_slots × 64 байта`. На GPU расход нулевой.
+
+Packet ring **опционален и отдельно активируется** на уже созданном publisher'е:
+
+```c
+cuframes_publisher_create(&cfg, &pub);
+
+cuframes_packet_ring_options_t pkt_opts = {
+    .ring_slots      = 64,
+    .data_size       = 8 * 1024 * 1024,
+    .max_packet_size = 2 * 1024 * 1024,
+    .codec_id        = 27,   // AV_CODEC_ID_H264
+};
+cuframes_publisher_enable_packets(pub, &pkt_opts);
+cuframes_publisher_set_codec_extradata(pub, sps_pps, sps_pps_size);
+
+// в цикле:
+cuframes_publisher_publish_packet(pub, nal_data, nal_size,
+                                  pts_ns, dts_ns,
+                                  CUFRAMES_PKT_FLAG_KEY);
+```
+
+Аналогично на subscriber'е:
+
+```c
+cuframes_subscriber_create(&cfg, &sub);
+cuframes_subscriber_enable_packets(sub);   // открывает второй SHM
+
+cuframes_packet_t *pkt;
+cuframes_subscriber_next_packet(sub, &pkt, -1);
+```
+
+Подписчик может включить любую комбинацию: только frame ring, только packet ring, оба сразу. Это два независимых SHM segment'а с разными magic.
+
+## Сравнение
+
+| | Frame ring | Packet ring |
+|---|---|---|
+| Содержимое | Decoded NV12 / RGB / etc | Encoded H.264 / H.265 NAL |
+| Транспорт | CUDA VMM + POSIX FD | POSIX shm |
+| Sync mechanism | atomic seq + `cuStreamSynchronize` | atomic seq (нет CUDA) |
+| Latency publish→consume | sub-frame, без encode roundtrip | sub-frame, но consumer декодирует |
+| Memory cost (publisher) | `ring_size × frame_size` GPU | `data_size` host shm |
+| Memory cost (consumer) | 0 (shared physical pages) | 0 (mmap same shm) |
+| Требует CUDA на consumer | да | нет |
+| Late join semantics | newest frame doc'умен | resync на last keyframe |
+| Typical use case | AI inference, GPU compositor | NVR recording, replay |
+
+## Можно ли один без другого
+
+Да. Frame ring аллоцируется в `cuframes_publisher_create` — без него publisher вообще не существует. Packet ring опционален: если `cuframes_publisher_enable_packets` не вызвать, publisher просто не примет `publish_packet`, а subscriber на `enable_packets` получит `CUFRAMES_ERR_NOT_FOUND`.
+
+Обратное (packet ring без frame ring) в текущем API не поддерживается — для pure encoded-only сценариев это TODO будущей версии.
+
+## Следующее
+
+- [Ownership modes](/docs/concepts/ownership-modes) — как выбрать кто аллоцирует ring.
+- [Synchronization](/docs/concepts/sync-vmm-stream) — почему frame ring sync через stream sync, а не через CUDA events.
+- [First publisher](/docs/getting-started/first-publisher) — рабочий C-пример без packet ring.
@@ -0,0 +1,119 @@
+---
+title: Ownership modes
+sidebar_position: 2
+---
+
+# Ownership modes
+
+Кто владеет CUDA-памятью в которую publisher пишет frame'ы — сама библиотека или внешний код. В заголовке `cuframes.h` объявлены оба варианта:
+
+```c
+typedef enum cuframes_ownership_mode {
+    CUFRAMES_OWNERSHIP_LIBRARY  = 0,
+    CUFRAMES_OWNERSHIP_EXTERNAL = 1,
+} cuframes_ownership_mode_t;
+```
+
+Но в v0.4 **работает только `LIBRARY`**. `EXTERNAL` оставлен в API для бинарной совместимости и помечен deprecated. Ниже — почему и что с этим делать если ваш код раньше использовал EXTERNAL.
+
+## LIBRARY mode (единственный рабочий в v0.4)
+
+Publisher просит библиотеку аллоцировать ring заданного размера. Каждый кадр publisher получает чистый slot, пишет в него, отдаёт обратно через publish.
+
+```c
+cuframes_publisher_config_t cfg = {
+    .key       = "cam1",
+    .width     = 1920,
+    .height    = 1080,
+    .format    = CUFRAMES_FORMAT_NV12,
+    .ownership = CUFRAMES_OWNERSHIP_LIBRARY,
+    .ring_size = 4,
+    .policy    = CUFRAMES_POLICY_DROP_OLDEST,
+};
+cuframes_publisher_t *pub;
+cuframes_publisher_create(&cfg, &pub);
+
+void *slot;
+cuframes_publisher_acquire(pub, &slot);
+// NVDEC / cuMemcpy / kernel пишет в slot
+cuframes_publisher_publish(pub, stream, cuframes_now_ns());
+```
+
+Под капотом библиотека:
+
+1. Аллоцирует `ring_size` слотов через `cuMemCreate(CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR)`.
+2. `cuMemMap` + `cuMemSetAccess` чтобы локальный publisher мог писать.
+3. При subscribe передаёт POSIX FD через `sendmsg(SCM_RIGHTS)`.
+4. Subscriber делает `cuMemImportFromShareableHandle` — получает указатель на ту же физическую HBM.
+
+Этот путь zero-copy для consumer'ов. На publisher'е memory overhead = `ring_size × frame_size`. На consumer'е — ноль.
+
+## EXTERNAL mode (deprecated в v0.4)
+
+Идея была: publisher уже имеет CUDA pointer'ы из чужого pool'а (FFmpeg `AVHWFramesContext`, NVDEC output, DeepStream, какой-то custom decoder) и хочет их просто пошарить, без extra allocation и без D2D copy.
+
+```c
+// Так раньше работало в v0.1–0.3. В v0.4 не работает.
+cuframes_publisher_create_external(&cfg, ffmpeg_pool_ptrs,
+                                   pool_size, frame_size, &pub);
+```
+
+В v0.4 этот путь сломан by design.
+
+### Почему EXTERNAL не работает с VMM
+
+cuframes v0.4 публикует frame'ы через POSIX FD, потому что это единственный CUDA-IPC канал который **не требует shared PID namespace** (см. [memory note про v0.4](/docs/intro) и [sync writeup](/docs/concepts/sync-vmm-stream)). FD получается через `cuMemExportToShareableHandle` — а эта функция **требует** чтобы память была аллоцирована через `cuMemCreate` с соответствующим `requestedHandleType`.
+
+Существующий `cudaMalloc` / `cudaMallocPitch` pointer (тот что отдаёт FFmpeg или DeepStream) к VMM не относится. Экспортировать его как POSIX FD нечем. Старый путь v0.3 использовал `cudaIpcGetMemHandle` (opaque 64-байтовая структура, передавалась через socket payload) — он работал с любой памятью, но требовал shared PID. На v0.4 от него ушли осознанно.
+
+### Что делать вместо
+
+Если у вас уже есть GPU pool из FFmpeg / NVDEC / etc — переходите на LIBRARY mode с одним extra device-to-device copy:
+
+```c
+// FFmpeg выдаёт frame в hwframe pool:
+AVFrame *src = ...;            // src->data[0] = cudaMalloc'd by FFmpeg
+
+void *slot;
+cuframes_publisher_acquire(pub, &slot);
+
+// 1 × DtoD copy с pitch:
+cuMemcpy2DAsync(&(CUDA_MEMCPY2D){
+    .srcMemoryType = CU_MEMORYTYPE_DEVICE,
+    .srcDevice = (CUdeviceptr)src->data[0],
+    .srcPitch  = src->linesize[0],
+    .dstMemoryType = CU_MEMORYTYPE_DEVICE,
+    .dstDevice = (CUdeviceptr)slot,
+    .dstPitch  = pub_pitch,
+    .WidthInBytes = width,
+    .Height       = height_y + height_uv,
+}, stream);
+
+cuframes_publisher_publish(pub, stream, pts_ns);
+```
+
+Так переведён инструмент `cuframes-rtsp-source` в составе репозитория cuframes — раньше он принимал FFmpeg pool через EXTERNAL, теперь делает acquire + 1 D2D copy. Overhead — единичный DtoD на 1920×1080 NV12 это десятки микросекунд, в порядке шума на фоне `cuStreamSynchronize`.
+
+### Memory trade-off
+
+| | LIBRARY (v0.4) | EXTERNAL (v0.3, deprecated) |
+|---|---|---|
+| Publisher extra alloc | `ring_size × frame_size` | 0 |
+| D2D copy per frame | 1 (если есть upstream pool) или 0 (если decoder пишет прямо в slot) | 0 |
+| Zero-copy для consumers | да | да |
+| Работает без shared PID | да | нет |
+| Поддерживается в v0.4 | да | **нет** |
+
+Если decoder можно научить писать прямо в slot (`acquire` сначала, потом decode в полученный pointer) — extra D2D исчезает. Так делает `cuframes-rtsp-source` со своим NVDEC pipeline'ом.
+
+## Вернётся ли EXTERNAL
+
+Если NVIDIA добавит способ экспорта `cudaMalloc`-памяти как POSIX FD — да, это вернёт zero-D2D путь без жертвования cross-namespace. На момент CUDA 12.4 такого API нет, и в roadmap NVIDIA это не анонсировано. На практике рассчитывать на это не стоит.
+
+Поле `ownership` в `cuframes_publisher_config_t` остаётся ради ABI стабильности. Передача `CUFRAMES_OWNERSHIP_EXTERNAL` в v0.4 вернёт `CUFRAMES_ERR_INVALID_ARG`. Вызов `cuframes_publisher_create_external` объявлен в заголовке, но возвращает ту же ошибку.
+
+## Следующее
+
+- [Synchronization](/docs/concepts/sync-vmm-stream) — почему v0.4 ушёл от CUDA events и почему это связано с тем же VMM-ограничением.
+- [First publisher](/docs/getting-started/first-publisher) — рабочий LIBRARY-mode пример.
+- [Protocol reference](/docs/reference/protocol) — wire format VMM_FDS handshake.
@@ -0,0 +1,121 @@
+---
+title: "Synchronization: stream sync, not CUDA events"
+sidebar_position: 3
+---
+
+# Synchronization: stream sync, not CUDA events
+
+Между producer'ом и consumer'ом в разных процессах нужен механизм, который гарантирует: к моменту когда consumer начинает читать slot, **все GPU writes producer'а в этот slot уже зафиксированы в HBM**. До v0.4 этим занимались CUDA IPC events. С v0.4 — `cuStreamSynchronize` + atomic ordering. Смена не косметическая, и здесь объяснено почему.
+
+## Что было в v0.3 — CUDA IPC events
+
+Producer на каждый publish делал `cudaEventRecord` на свой stream. Handle event'а (`cudaIpcEventHandle_t`) экспортировался один раз при старте и шарился со всеми subscriber'ами. Subscriber на каждый frame делал `cudaStreamWaitEvent` на свой stream — GPU scheduler сам ждал completion record'а producer'а перед тем как пустить DtoD copy в очередь.
+
+Преимущество: CPU не блокируется. Producer кидает работу в очередь и едет дальше; ожидание происходит в GPU command queue.
+
+Недостаток: **CUDA IPC events требуют shared PID namespace между процессами** — точно так же как требовал `cudaIpcOpenMemHandle`. NVIDIA Driver API экспортирует event handle только через тот же legacy IPC механизм, для которого нет POSIX FD аналога. `CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR` существует для memory, для events — нет.
+
+Этот недостаток нас и убил на Frigate. Frigate под s6-overlay не отдаёт share PID — а попытка `cudaIpcOpenEventHandle` без shared PID падает молча и subscribe зависает на первом frame timeout'ом. См. [memory feedback про pid share](/docs/intro).
+
+## Что делает v0.4 вместо
+
+Producer перед публикацией seq делает **`cuStreamSynchronize`** на тот stream куда писались GPU данные. Это блокирующая CPU-операция — функция возвращается только когда все pending writes этого stream'а зафиксированы. После этого atomic store в shared header.
+
+Из `producer.c::do_publish` (v0.4):
+
+```c
+static int do_publish(cuframes_publisher_t *pub, int32_t slot,
+                      void *stream, int64_t pts_ns)
+{
+    /* 1. ждём GPU writes этого stream'а */
+    cuStreamSynchronize((CUstream)stream);
+
+    /* 2. обнуляем ack-bitmap для нового seq */
+    atomic_store_explicit(&pub->hdr->slots[slot].ack_bitmap, 0,
+                          memory_order_release);
+    atomic_store_explicit(&pub->hdr->slots[slot].pts_ns, pts_ns,
+                          memory_order_release);
+
+    /* 3. publish slot.seq — после этого consumer его увидит */
+    atomic_store_explicit(&pub->hdr->slots[slot].seq, pub->next_seq,
+                          memory_order_release);
+
+    /* 4. publish global_seq — wake-up для poll'ящих consumer'ов */
+    atomic_store_explicit(&pub->hdr->global_seq, pub->next_seq,
+                          memory_order_release);
+    return 0;
+}
+```
+
+Consumer на той стороне (`consumer.c`) делает acquire-load, читает slot, и **повторно проверяет seq** уже после того как зафиксировал указатель — это защита от того что producer успел перезаписать slot между чтением `global_seq` и реальным копированием:
+
+```c
+uint64_t gs = atomic_load_explicit(&sub->hdr->global_seq, memory_order_acquire);
+// ... найти slot_idx по gs ...
+uint64_t slot_seq = atomic_load_explicit(&sub->hdr->slots[slot_idx].seq,
+                                          memory_order_acquire);
+int64_t pts = atomic_load_explicit(&sub->hdr->slots[slot_idx].pts_ns,
+                                    memory_order_acquire);
+
+/* v0.4: producer уже cuStreamSynchronize'нул перед atomic_store seq.
+ * post-check verify_seq защищает от перезаписи slot'а producer'ом. */
+uint64_t verify_seq = atomic_load_explicit(&sub->hdr->slots[slot_idx].seq,
+                                            memory_order_acquire);
+```
+
+После post-check consumer делает DtoD memcpy в свой stream. На одном GPU **hardware coherence гарантирована** — HBM один, кэш L2 общий, после `cuStreamSynchronize` producer'а все его writes уже в L2/HBM, и любой subsequent kernel/copy с того же GPU их увидит.
+
+## Sequence diagram
+
+```mermaid
+sequenceDiagram
+    participant PApp as Publisher app
+    participant PStream as Publisher CUDA stream
+    participant HBM as GPU HBM (shared)
+    participant Header as SHM header (atomic)
+    participant CApp as Consumer app
+    participant CStream as Consumer CUDA stream
+
+    PApp->>PStream: kernel / NVDEC writes
+    Note over PStream,HBM: async — pending in stream
+    PApp->>PStream: cuStreamSynchronize
+    PStream-->>HBM: all writes flushed
+    PStream-->>PApp: return (CPU unblocked)
+    PApp->>Header: atomic_store(seq, release)
+
+    loop poll
+        CApp->>Header: atomic_load(global_seq, acquire)
+    end
+    CApp->>Header: read pts, slot_seq
+    CApp->>Header: atomic_load(slot_seq, acquire)  [verify]
+    CApp->>CStream: cuMemcpyDtoDAsync(slot → dst)
+    CStream->>HBM: read — sees flushed data
+    CStream-->>CApp: copy enqueued
+```
+
+## Trade-offs
+
+| | v0.3 (CUDA IPC events) | v0.4 (stream sync) |
+|---|---|---|
+| Cross-namespace (без shared PID) | **нет** | да |
+| CPU блокировка на publish | нет | да, ~1 ms |
+| GPU command queue ordering | автоматически | вручную (acquire/release) |
+| Лишний race без post-check | нет | да, защищён verify_seq |
+| Зависимость от CUDA Driver feature | `cudaIpcGetEventHandle` | `cuStreamSynchronize` (всегда есть) |
+| Совместимость с s6-overlay / Frigate | сломано | работает |
+
+`cuStreamSynchronize` стоит порядка миллисекунды (зависит от того сколько pending work на stream'е). На 25 fps publisher это ≈ 2.5% CPU времени publisher thread'а — заметно, но не критично для real-time CCTV. Если для вашего сценария это дорого — возможна оптимизация через `cuEventQuery` polling, но v0.4 этого пока не делает (sync проще, корректнее, и достаточно дёшев).
+
+Отказ от events — это не "лучше или хуже", это смена области применения. v0.3 не работал в s6/Frigate. v0.4 работает, ценой ~1ms CPU per publish.
+
+## Что это означает для разработчика
+
+- Передавайте в `cuframes_publisher_publish` **тот же stream**, на котором писались данные. Иначе `cuStreamSynchronize` будет ждать чужие writes (в худшем случае — никаких) и data race вернётся.
+- `stream = NULL` (default stream) допустим, но default stream сериализуется со всем GPU контекстом — это обычно медленнее чем dedicated stream.
+- Consumer не должен полагаться на CUDA event sync — его больше нет. Stream sync на producer'е + atomic ordering на consumer'е заменяют всю старую IPC event machinery.
+
+## Следующее
+
+- [Frame ring vs Packet ring](/docs/concepts/frame-vs-packet-ring) — packet ring sync проще, там нет CUDA вообще.
+- [Ownership modes](/docs/concepts/ownership-modes) — почему VMM ограничение убрало и EXTERNAL и события одновременно.
+- [Protocol reference](/docs/reference/protocol) — точный layout shared header и atomic-полей.
@@ -0,0 +1,105 @@
+---
+title: FAQ
+sidebar_position: 99
+---
+
+# Frequently Asked Questions
+
+## Is cuframes production-ready?
+
+Honest answer: **early but tested in one real deployment**. v0.4 has been running 24+ hours in a CCTV setup (4 IP cameras → publisher → 2 subscribers: NVR record + grid composer → TV) without intervention. That's not "battle-tested at scale" — that's "one engineer's home setup".
+
+We recommend cuframes if:
+
+- You're building your own video pipeline and understand the risks of adopting an OSS library at v0.x.
+- Your team can read the source if something breaks (it's ~2k lines of C).
+- You'd rather pin a known commit than chase semver promises.
+
+We don't recommend cuframes if:
+
+- You need vendor support contracts.
+- You're shipping to customers who'd file an incident on the maintainer's email at 3 AM.
+- You can't afford to write a workaround if a feature lands wrong.
+
+## How does cuframes compare to DeepStream?
+
+|                       | cuframes                  | NVIDIA DeepStream         |
+| --------------------- | ------------------------- | ------------------------- |
+| Scope                 | Library (data plane only) | Full SDK + runtime        |
+| License               | LGPL-2.1+                 | Proprietary EULA          |
+| Footprint             | ~140 KB `.so`             | Multi-GB runtime          |
+| Lock-in               | None — you own the pipeline | Pipeline = DeepStream plugins |
+| Cross-process sharing | Native (this is the point) | Inside one process tree   |
+| Support               | Best-effort GitHub        | Paid enterprise           |
+| Learning curve        | Hours                     | Weeks                     |
+
+cuframes is **not** trying to replace DeepStream. It solves one specific problem: "I have one CUDA decoder and multiple processes that want the decoded frames without re-decoding."
+
+## Why not GStreamer?
+
+GStreamer has `cudaupload` / `cudadownload` elements but no zero-copy cross-process model — every consumer pulls its own pipeline. You can hack around with `shmsink` / `shmsrc` but you lose CUDA-residency (frames bounce through CPU memory). cuframes specifically avoids that round trip.
+
+## Why not DMA-BUF + V4L2?
+
+That's the modern kernel-native path and works cross-vendor. We considered it. Reasons we went with CUDA VMM:
+
+- Our target was NVIDIA-only anyway (existing CUDA decode pipeline).
+- DMA-BUF integration with CUDA requires `EGLStream` interop boilerplate — more code than VMM + POSIX FD path.
+- Driver support varies by GPU age; CUDA VMM is stable since CUDA 10.2.
+
+If your project is cross-vendor, DMA-BUF is the right choice and cuframes is not for you.
+
+## Can I use this on Windows?
+
+No. Implementation uses POSIX shared memory (`shm_open`), Unix sockets, and `SCM_RIGHTS` file descriptor passing. Porting to Windows would require:
+
+- Windows shared memory primitives (`CreateFileMapping`).
+- Different FD-sharing mechanism (`DuplicateHandle` via named pipe).
+- CUDA VMM `WIN32_HANDLE` instead of `POSIX_FILE_DESCRIPTOR`.
+
+Not on the roadmap. PRs welcome.
+
+## Can publisher and consumer be on different machines?
+
+No. POSIX file descriptors don't traverse a network. For cross-host video sharing you need a different transport: RTSP, SRT, NDI, or roll your own with shared-memory NIC RDMA. cuframes is strictly same-host.
+
+## What if the publisher dies while consumer is reading?
+
+Consumer's next `cuframes_subscriber_next()` returns `CUFRAMES_ERR_DISCONNECTED`. Consumer should:
+
+1. Call `cuframes_subscriber_destroy()`.
+2. Wait (e.g., 1–2 sec backoff).
+3. Try `cuframes_subscriber_create()` again with the same key.
+
+The FFmpeg demuxer (`cuframes://`) does this automatically — every 2 seconds re-subscribes and returns `EAGAIN` to the avformat layer instead of `EOF`. See `libavformat/cuframesdec.c` if you're reimplementing this for another framework.
+
+## Can I have multiple publishers on the same key?
+
+No. Each key (`/run/cuframes/<key>.sock` + `/dev/shm/cuframes-<key>`) maps to exactly one publisher. The publisher detects "already running" on `create()` via shm header + PID liveness check and fails with `CUFRAMES_ERR_ALREADY_EXISTS`.
+
+For load balancing or HA scenarios you'd need to layer your own naming scheme on top (e.g., `cam1-primary`, `cam1-backup`, and consumer logic to pick which to subscribe to).
+
+## How many subscribers can a publisher handle?
+
+`CUFRAMES_MAX_SUBSCRIBERS = 32` (bitmap-limited). Bumping this requires a protocol version change because the bitmap is in the SHM header.
+
+In practice we run 2–3 subscribers per camera (NVR + AI inference + grid composer). 32 is more than enough.
+
+## License questions
+
+LGPL-2.1+. You can use cuframes in commercial closed-source products as long as:
+
+- Dynamic linking (the `.so` is replaceable by the end user).
+- Any modifications to `libcuframes` itself are published under LGPL.
+
+Static linking pulls the whole project under LGPL — usually not what you want.
+
+If LGPL is incompatible with your use case (e.g., embedded with no replaceable library), reach out before forking.
+
+## Where do I report bugs / contribute?
+
+- Source repo: https://git.goldix.org/gx/cuframes
+- Issues: same repo `/issues`
+- Contributing guide: `CONTRIBUTING.md` in the source tree
+
+This documentation site lives separately at https://git.goldix.org/gx/cuframes-docs — typo fixes and content PRs land via the same Gitea.
@@ -0,0 +1,7 @@
+{
+  "label": "Getting started",
+  "position": 2,
+  "collapsible": true,
+  "collapsed": false,
+  "link": null
+}
@@ -0,0 +1,110 @@
+---
+title: First publisher
+sidebar_position: 2
+---
+
+# First publisher
+
+A minimal publisher that exposes a CUDA-resident ring of 4 NV12 frames at 1920×1080 and writes 10 frames into it. Each frame is filled with a single-byte pattern via `cudaMemsetAsync`, so a subscriber can later verify the contents end-to-end.
+
+This is a stripped-down version of [`spike/smoke_v04/smoke_pub.c`](https://git.goldix.org/gx/cuframes/src/branch/main/spike/smoke_v04/smoke_pub.c) in the cuframes repo.
+
+## Source
+
+```c
+/* first_publisher.c — publish 10 NV12 1920x1080 frames, then exit. */
+#include <cuframes/cuframes.h>
+#include <cuda_runtime.h>
+#include <stdio.h>
+#include <time.h>
+
+int main(int argc, char **argv) {
+    const char *key = argc > 1 ? argv[1] : "mykey";
+
+    cuframes_publisher_config_t cfg = {0};
+    cfg.key         = key;
+    cfg.width       = 1920;
+    cfg.height      = 1080;
+    cfg.format      = CUFRAMES_FORMAT_NV12;
+    cfg.ownership   = CUFRAMES_OWNERSHIP_LIBRARY;
+    cfg.ring_size   = 4;
+    cfg.policy      = CUFRAMES_POLICY_DROP_OLDEST;
+    cfg.cuda_device = 0;
+
+    cuframes_publisher_t *pub = NULL;
+    int r = cuframes_publisher_create(&cfg, &pub);
+    if (r != CUFRAMES_OK) {
+        fprintf(stderr, "create: %s\n", cuframes_strerror(r));
+        return 1;
+    }
+
+    cudaStream_t stream;
+    cudaStreamCreate(&stream);
+
+    for (int i = 0; i < 10; i++) {
+        void *ptr = NULL;
+        if ((r = cuframes_publisher_acquire(pub, &ptr)) != CUFRAMES_OK) break;
+
+        /* NV12 = Y plane + interleaved UV plane = width*height*3/2 bytes */
+        cudaMemsetAsync(ptr, (uint8_t)i, 1920 * 1080 * 3 / 2, stream);
+
+        r = cuframes_publisher_publish(pub, stream, cuframes_now_ns());
+        if (r != CUFRAMES_OK) break;
+
+        struct timespec ts = {.tv_nsec = 40000000};  /* 25 fps */
+        nanosleep(&ts, NULL);
+    }
+
+    cudaStreamDestroy(stream);
+    cuframes_publisher_destroy(pub);
+    return r == CUFRAMES_OK ? 0 : 1;
+}
+```
+
+## Walk-through
+
+**`cuframes_publisher_config_t cfg = {0};`** — always zero-initialise. The struct has a `_reserved[4]` field that must stay zero for forward ABI compatibility.
+
+**`cfg.key = "mykey"`** — uniquely names the publisher within the host. It becomes the path component of the Unix socket (`/run/cuframes/mykey.sock`) and of the POSIX SHM segment (`/dev/shm/cuframes-mykey`). Two publishers cannot share a key — the second one gets `CUFRAMES_ERR_ALREADY_EXISTS`.
+
+**`cfg.format = CUFRAMES_FORMAT_NV12`** plus `width`/`height` — frame geometry is fixed for the lifetime of the publisher. Subscribers see exactly these dimensions.
+
+**`cfg.ownership = CUFRAMES_OWNERSHIP_LIBRARY`** — the library allocates the CUDA ring buffer itself. The alternative, `CUFRAMES_OWNERSHIP_EXTERNAL`, lets you hand in pre-allocated device pointers (typically from a FFmpeg `AVHWFramesContext` pool). For details see [Concepts → Ownership modes](/docs/concepts/ownership-modes).
+
+**`cfg.ring_size = 4`** — number of frame slots. 2 is the minimum, 4 a reasonable default, 16 the cap. With `DROP_OLDEST` policy a slow consumer simply misses frames; the publisher never blocks.
+
+**`cuframes_publisher_acquire(pub, &ptr)`** — returns a CUDA device pointer to the next writable slot. Valid only until the matching `publish()` call.
+
+**`cudaMemsetAsync(ptr, ..., stream)`** — fill the frame on a CUDA stream of your choice. You do **not** have to synchronize this stream before publishing. The library will issue `cudaEventRecord` on the same stream inside `publish()`, and each subscriber will `cudaStreamWaitEvent` on its own stream before reading. This is the cross-process synchronization contract — see [Concepts → Cross-process sync](/docs/concepts/cross-process-sync).
+
+**`cuframes_publisher_publish(pub, stream, pts_ns)`** — make the slot visible to subscribers. The `pts_ns` is opaque to the library; the recommended source is `cuframes_now_ns()` (CLOCK_MONOTONIC in nanoseconds).
+
+**Cleanup** — `cuframes_publisher_destroy()` closes the socket, unlinks the SHM segment and releases the CUDA pool.
+
+## Compile
+
+```bash
+gcc -O2 -I/usr/local/include -I/usr/local/cuda/include \
+    -o first_publisher first_publisher.c \
+    -L/usr/local/lib -lcuframes \
+    -L/usr/local/cuda/lib64 -lcudart -lcuda
+```
+
+If you built cuframes without `cmake --install`, point `-I` and `-L` at your `build/` tree (`-I./include -L./build/libcuframes`).
+
+## Run
+
+```bash
+./first_publisher mykey
+```
+
+While running, the process owns:
+
+- `/run/cuframes/mykey.sock` — the handshake / control socket
+- `/dev/shm/cuframes-mykey` — the shared metadata header (SHM)
+
+Both are removed on clean shutdown. If the publisher crashes, stale files may remain; the next start re-creates them.
+
+## Next
+
+Open a second terminal and wire up a [First subscriber](./first-subscriber.md) that reads these frames and validates the pattern. For the full API surface see [Reference → C API](/docs/reference/api-c).
@@ -0,0 +1,145 @@
+---
+title: First subscriber
+sidebar_position: 3
+---
+
+# First subscriber
+
+A minimal subscriber that connects to the publisher from [First publisher](./first-publisher.md), reads 10 frames, and checks that every byte of each frame matches the publisher-side pattern.
+
+This is a stripped-down version of [`spike/smoke_v04/smoke_sub.c`](https://git.goldix.org/gx/cuframes/src/branch/main/spike/smoke_v04/smoke_sub.c).
+
+## Source
+
+```c
+/* first_subscriber.c — connect, read 10 frames, verify pattern. */
+#include <cuframes/cuframes.h>
+#include <cuda_runtime.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+int main(int argc, char **argv) {
+    const char *key = argc > 1 ? argv[1] : "mykey";
+
+    cuframes_subscriber_config_t cfg = {0};
+    cfg.key                = key;
+    cfg.consumer_name      = "first-sub";
+    cfg.mode               = CUFRAMES_MODE_NEWEST_ONLY;
+    cfg.cuda_device        = 0;
+    cfg.connect_timeout_ms = 5000;
+
+    cuframes_subscriber_t *sub = NULL;
+    int r = cuframes_subscriber_create(&cfg, &sub);
+    if (r != CUFRAMES_OK) {
+        fprintf(stderr, "create: %s\n", cuframes_strerror(r));
+        return 1;
+    }
+
+    cudaStream_t stream;
+    cudaStreamCreate(&stream);
+
+    const size_t sample = 1024;             /* check first 1 KiB of each frame */
+    uint8_t *host = malloc(sample);
+
+    int frames = 0, good = 0;
+    while (frames < 10) {
+        cuframes_frame_t *f = NULL;
+        r = cuframes_subscriber_next(sub, stream, &f, 2000);
+        if (r != CUFRAMES_OK) {
+            fprintf(stderr, "next: %s\n", cuframes_strerror(r));
+            break;
+        }
+
+        cudaMemcpyAsync(host, cuframes_frame_cuda_ptr(f), sample,
+                        cudaMemcpyDeviceToHost, stream);
+        cudaStreamSynchronize(stream);
+
+        int mismatch = 0;
+        for (size_t i = 1; i < sample; i++)
+            if (host[i] != host[0]) mismatch++;
+        if (mismatch == 0) good++;
+
+        printf("seq=%lu pts_ns=%lld pitch_y=%d byte0=0x%02x mismatch=%d\n",
+               (unsigned long)cuframes_frame_seq(f),
+               (long long)cuframes_frame_pts_ns(f),
+               cuframes_frame_pitch_y(f),
+               host[0], mismatch);
+
+        cuframes_subscriber_release(sub, f);
+        frames++;
+    }
+
+    free(host);
+    cudaStreamDestroy(stream);
+    cuframes_subscriber_destroy(sub);
+    return (good == frames && frames > 0) ? 0 : 1;
+}
+```
+
+## Walk-through
+
+**`cfg.key`** — must match the publisher's key exactly. The subscriber finds the publisher by `connect()`-ing to `/run/cuframes/<key>.sock`.
+
+**`cfg.consumer_name = "first-sub"`** — identifies this subscriber inside the publisher's ACK bitmap. It must be unique among live subscribers of the same publisher; a collision returns `CUFRAMES_ERR_ALREADY_EXISTS`. If you pass `NULL` the library generates `subscriber-<pid>-<random>`. A publisher accepts up to 32 simultaneous subscribers.
+
+**`cfg.mode = CUFRAMES_MODE_NEWEST_ONLY`** — the subscriber always jumps to the latest published frame and skips any frames the publisher produced while the previous `next()` call was being processed. Use `CUFRAMES_MODE_STRICT_ORDER` if you must see every frame in seq order; in that mode a ring overflow surfaces as `CUFRAMES_ERR_DISCONNECTED`.
+
+**`cfg.connect_timeout_ms = 5000`** — how long `create()` waits for the publisher to exist. `0` fails immediately with `CUFRAMES_ERR_NOT_FOUND`, `-1` waits forever.
+
+**`cfg.cuda_device`** — must equal the publisher's `cuda_device`. CUDA IPC handles are not portable across devices.
+
+**`cuframes_subscriber_next(sub, stream, &f, 2000)`** — block up to 2 s for the next frame. The library internally issues `cudaStreamWaitEvent` on your `stream` against the publisher's record-event, so any kernel you launch on `stream` after `next()` returns is guaranteed to see the producer's writes. If you read with `cudaMemcpyDeviceToHost`, queue it on the same stream — that is what makes the cross-process sync work.
+
+**Frame accessors** — `cuframes_frame_cuda_ptr()` (device pointer, read-only), `cuframes_frame_format()`, `cuframes_frame_pitch_y()` / `_pitch_uv()`, `cuframes_frame_seq()` (monotonic per publisher), `cuframes_frame_pts_ns()` (publisher-side CLOCK_MONOTONIC). After `release()` the handle is invalid — do not call any accessor on it.
+
+**`cuframes_subscriber_release(sub, f)`** — acknowledges the slot back to the publisher. The publisher needs this only when running with `CUFRAMES_POLICY_STRICT_WAIT`; under the default `DROP_OLDEST` it is still required to free the consumer-side handle. NULL is a no-op.
+
+## Compile
+
+```bash
+gcc -O2 -I/usr/local/include -I/usr/local/cuda/include \
+    -o first_subscriber first_subscriber.c \
+    -L/usr/local/lib -lcuframes \
+    -L/usr/local/cuda/lib64 -lcudart -lcuda
+```
+
+## Run
+
+In one terminal, start the publisher from the previous page:
+
+```bash
+./first_publisher mykey
+```
+
+In another terminal:
+
+```bash
+./first_subscriber mykey
+```
+
+Expected output is 10 lines with rising `seq` and `mismatch=0`.
+
+## Docker note
+
+The subscriber must share the publisher's **IPC namespace** (so it can `shm_open` the same `/dev/shm/cuframes-mykey` header). It does **not** need to share the PID namespace — this is a v0.4 change. The old v0.1 / v0.2 requirement of `--pid=container:<publisher>` is gone because handles travel as POSIX file descriptors over the Unix socket (`SCM_RIGHTS`), not as CUDA IPC mem-handles.
+
+```bash
+docker run --rm --runtime=nvidia \
+  --ipc=container:cuframes-pub \
+  -v /run/cuframes:/run/cuframes:ro \
+  gx/cuframes:0.4 ./first_subscriber mykey
+```
+
+## Handling disconnects
+
+If the publisher exits or crashes while you are looping, the next `cuframes_subscriber_next()` returns `CUFRAMES_ERR_DISCONNECTED`. The handle is then dead — destroy it and (optionally) reconnect:
+
+```c
+if (r == CUFRAMES_ERR_DISCONNECTED) {
+    cuframes_subscriber_destroy(sub);
+    sub = NULL;
+    /* sleep + retry cuframes_subscriber_create(&cfg, &sub) */
+}
+```
+
+A reconnect pattern, including back-off and `consumer_name` reuse caveats, is covered in [Concepts → Reconnect](/docs/concepts/reconnect).
@@ -0,0 +1,110 @@
+---
+title: Install
+sidebar_position: 1
+---
+
+# Install
+
+cuframes is **Linux only**. The IPC mechanism relies on POSIX shared memory and `SCM_RIGHTS` file-descriptor passing over Unix sockets. Windows, macOS and WSL2 are not supported.
+
+You also need an NVIDIA GPU with compute capability ≥ 7.5 (Turing or newer) and a CUDA 12+ driver. See [Concepts → Requirements](/docs/concepts/requirements) for the full matrix.
+
+## Option 1 — Pre-built Docker image (recommended for trying it out)
+
+The runtime image ships `libcuframes.so` and the `cuframes-rtsp-source` bridge tool on top of `nvidia/cuda:12.4.1-runtime`.
+
+```bash
+docker pull gx/cuframes:0.4
+```
+
+Smoke check:
+
+```bash
+docker run --rm --runtime=nvidia gx/cuframes:0.4 \
+  /usr/local/bin/cuframes-rtsp-source --help
+```
+
+To run a publisher and a subscriber from two containers, the **publisher** container must start with `--ipc=shareable` and the **subscriber** must share its IPC namespace via `--ipc=container:<publisher>`. PID namespace sharing is **not** required since v0.4 — handles are exchanged as POSIX file descriptors over the Unix socket.
+
+```bash
+# Publisher
+docker run -d --name cuframes-pub --runtime=nvidia --ipc=shareable \
+  -v /run/cuframes:/run/cuframes \
+  gx/cuframes:0.4 \
+  /usr/local/bin/cuframes-rtsp-source --rtsp 'rtsp://...' --key cam1
+
+# Subscriber
+docker run --rm --runtime=nvidia \
+  --ipc=container:cuframes-pub \
+  -v /run/cuframes:/run/cuframes:ro \
+  gx/cuframes:0.4 \
+  /usr/local/bin/sub_count --key cam1 --max-frames 100
+```
+
+See [Concepts → Docker IPC](/docs/concepts/docker-ipc) for the underlying namespace rules.
+
+## Option 2 — Build from source
+
+### Build requirements
+
+| | Minimum |
+|---|---|
+| CUDA Toolkit | 12.0 |
+| NVIDIA driver | 525 |
+| CMake | 3.20 |
+| GCC / Clang | 11 / 14 |
+| FFmpeg dev libs | libavcodec, libavformat, libavutil (only for `cuframes-rtsp-source`) |
+
+On Ubuntu 22.04 / 24.04:
+
+```bash
+sudo apt-get install -y \
+  build-essential cmake ninja-build pkg-config \
+  libavcodec-dev libavformat-dev libavutil-dev
+```
+
+### Configure and build
+
+```bash
+git clone https://git.goldix.org/gx/cuframes.git
+cd cuframes
+cmake -B build -S . -G Ninja -DCMAKE_BUILD_TYPE=Release
+cmake --build build --parallel
+```
+
+This produces:
+
+- `build/libcuframes/libcuframes.so` — the shared library
+- `build/tools/cuframes-rtsp-source/cuframes-rtsp-source` — the RTSP bridge
+- `build/examples/sub_count/sub_count` — reference subscriber
+
+### Install system-wide
+
+```bash
+sudo cmake --install build --prefix /usr/local
+sudo ldconfig
+```
+
+Headers land in `/usr/local/include/cuframes/`, the library in `/usr/local/lib/`.
+
+### Build options
+
+| Option | Default | Notes |
+|---|---|---|
+| `BUILD_TOOLS` | `ON` | `cuframes-rtsp-source` (needs FFmpeg dev libs) |
+| `BUILD_EXAMPLES` | `ON` | `sub_count` reference subscriber |
+| `BUILD_TESTING` | `ON` | unit + stress tests |
+| `BUILD_FFMPEG_FILTER` | `OFF` | out-of-tree, requires a patched FFmpeg tree |
+| `BUILD_PYTHON_BINDINGS` | `OFF` | planned |
+
+## Option 3 — apt / dpkg packages
+
+Coming when v1.0 ships. Until then, use the Docker image or build from source.
+
+## Verify the install
+
+```bash
+cuframes-rtsp-source --help
+```
+
+If the binary is on `PATH` and prints its usage banner, the runtime is wired up. To verify that the library itself is loadable from your own code, jump to [First publisher](./first-publisher.md).
@@ -0,0 +1,7 @@
+{
+  "label": "Integration",
+  "position": 4,
+  "collapsible": true,
+  "collapsed": false,
+  "link": null
+}
@@ -0,0 +1,125 @@
+---
+title: FFmpeg cuframes:// demuxer
+sidebar_position: 1
+---
+
+# FFmpeg `cuframes://` demuxer
+
+cuframes ships two FFmpeg input demuxers, both delivered as a patch on top of upstream FFmpeg:
+
+- **`cuframes`** — subscribes to a decoded NV12 frame ring and exposes it as a `rawvideo` stream (one stream per URL).
+- **`cuframes_packets`** — subscribes to an encoded packet ring and exposes it as an `h264` / `hevc` byte-stream, with `extradata` taken from the publisher's handshake.
+
+Both demuxers are pure consumers. They never decode, never re-encode, and never touch the network — the actual RTSP pull happens once in the publisher (typically [`cuframes-rtsp-source`](/docs/getting-started/install)). FFmpeg just attaches to the existing ring via a Unix socket.
+
+## URL scheme
+
+```text
+cuframes://<key>            # decoded NV12 frames (raw GPU surfaces)
+cuframes_packets://<key>    # encoded H264/HEVC packets (Annex-B)
+```
+
+`<key>` is the publisher's key — the same string passed to `cuframes_publisher_create()` or to `--key` on `cuframes-rtsp-source`. The legacy `cuframes:<key>` form (no `//`) is also accepted.
+
+The two ring types are independent. A single publisher can expose both: decoded frames for compositors / AI, encoded packets for recorders that want to skip a re-encode.
+
+## What the demuxer does
+
+On open, the demuxer:
+
+1. Connects to `/run/cuframes/<key>.sock`.
+2. Receives N POSIX file descriptors via `SCM_RIGHTS` (frame slots) plus the shm metadata header.
+3. For `cuframes`: imports each FD as a CUDA VMM allocation, advertises a single `AV_PIX_FMT_NV12` stream at the publisher's width/height/framerate.
+4. For `cuframes_packets`: reads `extradata` (SPS/PPS for H264, VPS/SPS/PPS for HEVC) from the handshake and advertises one `AV_CODEC_ID_H264` or `AV_CODEC_ID_HEVC` stream.
+
+In the read loop the demuxer polls the publisher's `global_seq`, copies the frame / packet into the pipeline, and stamps `pts` from the publisher's clock. See [Protocol reference](/docs/reference/protocol) for the wire format.
+
+## Pipeline examples
+
+### Single source — decoded ring → NVENC → MPEG-TS
+
+Re-encode a published camera into an H.264 MPEG-TS UDP stream. No NVDEC happens in this ffmpeg — the publisher already decoded once, this process just NVENCs the shared NV12 surface.
+
+```bash
+ffmpeg -hwaccel cuda -hwaccel_output_format cuda \
+  -f cuframes -i cuframes://cam-parking \
+  -c:v h264_nvenc -preset p4 -b:v 4M \
+  -f mpegts udp://192.168.88.50:5000
+```
+
+### Packet ring — true copy, no decode and no encode
+
+When you only need to record or restream an existing camera and do not care about the decoded pixels, subscribe to the **packet** ring with `-c:v copy`. Both NVDEC and NVENC stay idle.
+
+```bash
+ffmpeg -f cuframes_packets -i cuframes_packets://cam-parking \
+  -c:v copy -f mp4 /var/recordings/cam-parking.mp4
+```
+
+This is the cheapest way to fan out a single decode to N recorders.
+
+### Composition — 4 inputs into a CUDA grid
+
+A short example of multi-input wiring. The full filter reference lives in [FFmpeg `vf_cuda_grid` filter](/docs/integration/ffmpeg-filter).
+
+```bash
+ffmpeg \
+  -hwaccel cuda -hwaccel_output_format cuda \
+  -f cuframes -i cuframes://cam1 \
+  -f cuframes -i cuframes://cam2 \
+  -f cuframes -i cuframes://cam3 \
+  -f cuframes -i cuframes://cam4 \
+  -filter_complex "[0:v][1:v][2:v][3:v]cuda_grid=layout=quad[out]" \
+  -map "[out]" -c:v h264_nvenc -preset p4 \
+  -f rtsp rtsp://127.0.0.1:8554/grid
+```
+
+## Build / run with the patched FFmpeg
+
+The demuxers live in `libavformat/cuframesdec.c` and `libavformat/cuframes_packetsdec.c` and are not in upstream FFmpeg. You have two options.
+
+### Option A — pre-built Docker image
+
+A production-tested image is published as `ffmpeg-vf-cuda-grid:phase8`. It contains the patched ffmpeg binary, `libcuframes.so`, and the `vf_cuda_grid` filter. The reference `docker compose` setup that wires it together with a publisher container lives in the `localhost-infra` repo — copy and adapt, do not pin it as a public dependency.
+
+```bash
+docker run --rm --runtime=nvidia \
+  --ipc=container:cuframes-publisher \
+  ffmpeg-vf-cuda-grid:phase8 \
+  ffmpeg -f cuframes -i cuframes://cam1 -c:v copy -f null -
+```
+
+The `--ipc=container:...` flag matches the publisher's IPC namespace so the POSIX shm header is visible. PID namespace sharing is **not** required since cuframes v0.4.
+
+### Option B — build it yourself
+
+Use the `ffmpeg-builds` toolchain (a fork of `BtbN/FFmpeg-Builds`). The script `scripts.d/50-libcuframes.sh` clones cuframes, builds it static, and `--enable-libcuframes` is appended automatically when the `cuframes` addin is active.
+
+```bash
+git clone <ffmpeg-builds repo> ffmpeg-builds
+cd ffmpeg-builds
+ADDITIONAL_SCRIPTS=50-libcuframes.sh ./build.sh <target> <variant>
+```
+
+The patched ffmpeg source tree (with both demuxers and the filter) lives in `ffmpeg-fresh/`. If you want to vendor the patch into your own FFmpeg fork, copy the three files (`libavformat/cuframesdec.c`, `libavformat/cuframes_packetsdec.c`, `libavfilter/vf_cuda_grid.c`) plus the matching `Makefile` and `allformats.c` / `allfilters.c` registrations.
+
+## Reconnect behaviour
+
+Publishers come and go — container restarts, RTSP camera reboots, the host's `cuframes-rtsp-source` is updated. The demuxer is designed to **survive a publisher restart without tearing down the FFmpeg pipeline**.
+
+When the subscriber sees `CUFRAMES_ERR_DISCONNECTED`:
+
+- The demuxer does **not** return `AVERROR_EOF`.
+- It releases the dead subscriber and tries `cuframes_subscriber_create()` again, rate-limited to **one attempt every 2 seconds**.
+- While reconnecting, `av_read_frame()` returns `AVERROR(EAGAIN)`. The pipeline blocks but stays alive.
+- On success the demuxer logs `cuframes: reconnected to '<key>'` at `INFO` level and resumes delivering frames.
+
+This matters for long-running consumers (NVR recorders, RTSP restreamers, NVENC composers) that would otherwise need an external supervisor to restart ffmpeg on every publisher hiccup.
+
+If you actually want EOF on disconnect — e.g. a one-shot transcode that should stop when the source dies — wrap the demuxer with `-timeout` or your own watchdog. The built-in behaviour is "wait forever", not "fail fast".
+
+## See also
+
+- [First publisher](/docs/getting-started/first-publisher) — minimal C producer.
+- [Protocol reference](/docs/reference/protocol) — wire format and handshake.
+- [`vf_cuda_grid` filter](/docs/integration/ffmpeg-filter) — multi-camera composition.
@@ -0,0 +1,103 @@
+---
+title: FFmpeg vf_cuda_grid filter (CCTV grid composition)
+sidebar_position: 2
+---
+
+# FFmpeg `vf_cuda_grid` filter
+
+`vf_cuda_grid` is a CUDA-accelerated multi-input grid compositor for CCTV-style live walls — N camera streams in, one composed frame out, all on the GPU.
+
+:::note Not part of cuframes
+`vf_cuda_grid` is a **separate, out-of-tree FFmpeg filter** maintained as its own project ([git.goldix.org/gx/vf-cuda-grid](https://git.goldix.org/gx/vf-cuda-grid)). It happens to ship in the same patched FFmpeg image as the cuframes demuxers, and it is the canonical consumer that shows what cuframes is actually *for*. But it is not part of the cuframes library and follows its own release cycle.
+
+This page is a quick pointer so the integration story is end-to-end readable. For full options, layout templates, ZMQ command protocol and overlay rendering details, see the upstream project.
+:::
+
+## Why it exists
+
+A typical CCTV TV-wall pipeline needs:
+
+- Decode 4–16 cameras once (the cuframes publishers do this with NVDEC).
+- Compose them into a single grid layout (single / quad / main + previews / …).
+- Re-encode the composed frame with NVENC and push it as RTSP / SRT to displays.
+
+Doing this with standard FFmpeg filters means CPU round-trips (`hwdownload` / `hwupload`), which kills frame rate at 4K. `vf_cuda_grid` stays in CUDA memory end-to-end — the input surfaces come straight from the cuframes ring, are placed into a single output CUDA frame via `cuMemcpy2DAsync`, and handed to `h264_nvenc` without ever leaving the GPU.
+
+## Quick example — 4 cameras into a 2×2 quad
+
+```bash
+ffmpeg \
+  -hwaccel cuda -hwaccel_output_format cuda \
+  -f cuframes -i cuframes://cam1 \
+  -f cuframes -i cuframes://cam2 \
+  -f cuframes -i cuframes://cam3 \
+  -f cuframes -i cuframes://cam4 \
+  -filter_complex "[0:v][1:v][2:v][3:v]cuda_grid=layout=quad[out]" \
+  -map "[out]" \
+  -c:v h264_nvenc -preset p4 -tune ll -b:v 8M \
+  -f rtsp rtsp://127.0.0.1:8554/wall
+```
+
+All four inputs are decoded NV12 surfaces shared from publisher processes. The filter never sees CPU memory. Encoder gets a CUDA frame.
+
+## Layout templates
+
+The filter currently ships these templates (defined in `libavfilter/vf_cuda_grid.c`):
+
+- `single` — one input, full canvas. Useful for runtime layout switching (start with quad, switch to fullscreen on a single camera).
+- `dual_h`, `dual_v` — two inputs, horizontal / vertical split.
+- `quad` — four inputs, 2×2.
+- `main_plus_preview` — one large cell plus a row of three smaller ones.
+
+Select via `layout=<name>`. Output canvas dimensions default to the first input's resolution and are configurable via filter options.
+
+## Runtime control via ZMQ
+
+The filter exposes two commands through FFmpeg's `process_command` mechanism (which the `zmq` filter forwards):
+
+- `set_layout <name>` — swap the active layout template without restarting the pipeline. Bounded by `max_cells` (compile-time max so the input pads do not need to be re-added).
+- `add_overlay <args>` — draw a colored rectangle / text / image overlay inside a specific cell (e.g. a red border on a camera with motion, a timestamp on cell 0).
+
+The argument format and full overlay semantics live in the upstream project's docs.
+
+:::tip ZMQ argument quoting
+When passing ZMQ commands from a shell, the command + args **must** be a single quoted argument — otherwise FFmpeg's ZMQ filter parses only the command and silently drops the rest. This has bitten enough people to mention here:
+
+```bash
+# correct
+echo "Parsed_cuda_grid_0 set_layout main_plus_preview" | zmqsend
+
+# wrong — args silently dropped
+echo Parsed_cuda_grid_0 set_layout main_plus_preview | zmqsend
+```
+:::
+
+## Production pipeline
+
+A real CCTV-wall pipeline (multiple publishers, the filter, NVENC, an RTSP server, plus the ZMQ controller for layout switching) is wired together in the `localhost-infra` repo as `docker-compose.phase7.yml` (or whatever is current — check the repo). It is the most up-to-date reference for what flags actually work together and which images to pin.
+
+**Do not copy a compose snippet from this page** — by the time you read it the image tags will have drifted. Open the live file in `localhost-infra` and adapt from there.
+
+The general shape is:
+
+```text
+[ cuframes-rtsp-source × N ]   (publishers, one per camera)
+            │
+            │  cuframes://camN  (decoded NV12, shared via SCM_RIGHTS)
+            ▼
+[ ffmpeg + vf_cuda_grid ]      (composer, GPU-side)
+            │
+            │  H264 NVENC
+            ▼
+[ mediamtx ]                   (RTSP / WebRTC fan-out)
+            │
+            ▼
+[ TV / browser ]
+```
+
+A separate small controller process talks to the composer over ZMQ to switch layouts on user input.
+
+## See also
+
+- [`cuframes://` demuxer](/docs/integration/ffmpeg-demuxer) — how the inputs get into FFmpeg.
+- Upstream filter project — [git.goldix.org/gx/vf-cuda-grid](https://git.goldix.org/gx/vf-cuda-grid) — full reference, build instructions, ZMQ protocol, overlay types.
@@ -0,0 +1,47 @@
+---
+title: Python bindings
+sidebar_position: 3
+---
+
+# Python bindings
+
+**Status: planned for v0.5+. Not shipped yet.**
+
+This page is a placeholder so the integration story is honest about what exists and what does not. If you need Python access to cuframes today, read [Workaround for v0.4](#workaround-for-v04) below.
+
+## What v0.5+ will ship
+
+A small `cuframes` Python package, distributed as a wheel, providing:
+
+- **`ctypes`-based bindings** around the existing C API (`cuframes_publisher_create`, `cuframes_subscriber_create`, `cuframes_acquire`, `cuframes_release`, etc.). No new ABI, no SWIG, no C++ wrapper.
+- **Zero-copy NumPy / PyTorch access** to the decoded NV12 surface via the [CUDA Array Interface](https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html). A subscribed frame is exposed as a CuPy / PyTorch tensor that points at the same GPU memory the publisher wrote — no `cudaMemcpy` to host, no copy on the device either.
+- **Context-manager API** for acquire / release so frame slots cannot leak across exceptions.
+
+Target use cases:
+
+- **PyTorch inference on shared frames** — a detector / classifier subscribes to `cuframes://camN`, gets a tensor, runs `model(tensor)` directly. Today this requires either re-decoding the RTSP stream or copying frames over a Unix pipe.
+- **OpenCV CUDA processing** — `cv2.cuda_GpuMat` constructed from the cuframes pointer, then any cv2.cuda op (resize, color convert, optical flow) runs in place.
+- **Quick prototyping** — a Jupyter notebook that subscribes to a live camera, slices out a region of interest, and visualises it without spinning up a full FFmpeg pipeline.
+
+There is no plan to add CPU-side NumPy fallback. Frames are GPU surfaces; if you want them on the CPU you do an explicit `tensor.cpu()` and accept the copy.
+
+## Workaround for v0.4
+
+Until the bindings ship, the supported path is to call `libcuframes.so` directly from Python using `ctypes` or `cffi`. The C API is small (≈ 10 functions) and stable within a v0.x release.
+
+Realistically though, most v0.4 deployments do **not** call cuframes from Python at all. They use the existing C/FFmpeg path:
+
+- For inference: subscribe to the **packet** ring with `ffmpeg -f cuframes_packets -i cuframes_packets://camN` and pipe decoded frames into your Python process. You lose zero-copy but gain a working pipeline today.
+- For prototyping: use the C examples in [`examples/`](https://git.goldix.org/gx/cuframes) as your starting point.
+
+If you need a Python-side zero-copy path *right now*, you are going to write the `ctypes` wrappers yourself. Mirror the prototypes from [`include/cuframes/cuframes.h`](/docs/reference/protocol), keep handles opaque (`void *`), and use `cuMemAlloc` / CuPy `UnownedMemory` to view the imported VMM allocation. Expect rough edges — the v0.5 bindings exist precisely because this is not fun to do by hand.
+
+## Track progress
+
+Roadmap and milestone notes for the Python package live in [`ROADMAP.md`](https://git.goldix.org/gx/cuframes) in the cuframes repo. v0.5 also gates on the HEVC packet-ring path and a small ABI cleanup, so the bindings are not the only thing on that release.
+
+## See also
+
+- [Install](/docs/getting-started/install) — get `libcuframes.so` onto your system so the bindings will have something to load.
+- [Protocol reference](/docs/reference/protocol) — the C API surface the bindings will mirror.
+- [FFmpeg `cuframes://` demuxer](/docs/integration/ffmpeg-demuxer) — today's practical path for getting cuframes data into any non-C consumer.
@@ -0,0 +1,59 @@
+---
+title: What is cuframes
+sidebar_position: 1
+slug: /intro
+---
+
+# cuframes
+
+**Zero-copy decoded video frames over CUDA, shared across processes — without pid namespace sharing.**
+
+Pure C library (LGPL-2.1+) для передачи decoded NV12 frames между Linux-процессами через CUDA VMM + POSIX file descriptors. Никаких re-encode, ни CPU-side memcpy, ни Unix-pipe сериализации — consumer получает указатель на ту же GPU-память что и producer.
+
+```mermaid
+flowchart LR
+  RTSP[RTSP camera] --> Pub[Publisher<br/>NVDEC → VMM pool]
+  Pub -- POSIX FD via SCM_RIGHTS --> Sub1[Subscriber 1<br/>FFmpeg detect]
+  Pub -- POSIX FD via SCM_RIGHTS --> Sub2[Subscriber 2<br/>FFmpeg compose]
+  Pub -- POSIX FD via SCM_RIGHTS --> Sub3[Subscriber 3<br/>AI inference]
+```
+
+## Status
+
+**v0.4 — early but production-tested.** Single deployment в реальной CCTV-системе (4 IP-cameras, NVENC re-encode chain, 25 fps). Не enterprise-ready: один maintainer, нет paid support, ABI ломался 4 раза за месяц разработки.
+
+Подходит если ты строишь свой video pipeline и понимаешь риски OSS-библиотеки на ранней стадии. Не подходит как drop-in замена DeepStream / GStreamer для enterprise-сценариев.
+
+## Why
+
+Типичный кейс — один декодер RTSP-камеры, несколько consumer'ов (NVR-запись, AI-детектор, live-композитор для TV). Без cuframes:
+
+- **Naive путь:** каждый consumer открывает свой RTSP-stream → N×NVDEC + N×NIC. Если камер 10, потребителей 3 — это 30 параллельных декодов.
+- **DeepStream:** работает, но vendor lock-in, тяжёлый runtime, лицензионные ограничения.
+- **CUDA IPC handles напрямую:** требуют shared PID namespace между процессами. Frigate, Docker-стек на k8s — это часто несовместимо.
+
+cuframes решает третью проблему: один decoder publish'ит, любое число consumer'ов subscribe'ятся через unix socket, FD'ы передаются через `SCM_RIGHTS`, никакого pid sharing не нужно.
+
+## Что cuframes не делает
+
+- **Encoded video.** Frame ring — это decoded NV12. Для encoded H.264/H.265 packet stream есть отдельный packet ring (Annex-B byte stream через POSIX shm).
+- **Cross-host.** Только same-machine — POSIX FD не передаётся через сеть.
+- **Cross-vendor.** Только NVIDIA — `cuMemCreate` это CUDA API. AMD HIP / Intel oneAPI пока не поддерживаются.
+- **Windows.** Только Linux (POSIX shm + Unix sockets + SCM_RIGHTS).
+- **HA / failover.** Один producer per key — если падает, consumers видят `CUFRAMES_ERR_DISCONNECTED` и сами решают что делать.
+
+## Architecture in 30 seconds
+
+- **Producer** (`cuframes_publisher_create`) — allocates N×frame VMM pool (`cuMemCreate(POSIX_FILE_DESCRIPTOR)`), opens Unix socket `/run/cuframes/<key>.sock`, opens POSIX shm `/dev/shm/cuframes-<key>` для metadata header.
+- **Subscriber** (`cuframes_subscriber_create`) — connects socket, handshake-receives N file descriptors через `SCM_RIGHTS`, imports их через `cuMemImportFromShareableHandle`, mmap'ит shm header.
+- **Publish loop:** producer `acquire()` → memcpy в slot → `publish(stream, pts)` который делает `cuStreamSynchronize` + atomic update `slot.seq` + `global_seq`.
+- **Consume loop:** subscriber polls atomic `global_seq`, читает frame через DtoD copy в свой stream.
+- **Sync:** producer's `cuStreamSynchronize` перед publish гарантирует hardware coherence — consumer читает данные через atomic load без CUDA events (modern simplification от `v0.4`).
+
+Полная спецификация — [Protocol reference](/docs/reference/protocol).
+
+## Next
+
+- [Install](/docs/getting-started/install) — apt / docker / from source.
+- [First publisher](/docs/getting-started/first-publisher) — 30-line C example.
+- [FFmpeg demuxer integration](/docs/integration/ffmpeg-demuxer) — `cuframes://key` URL scheme.
@@ -0,0 +1,7 @@
+{
+  "label": "Reference",
+  "position": 5,
+  "collapsible": true,
+  "collapsed": true,
+  "link": null
+}
@@ -0,0 +1,531 @@
+---
+sidebar_position: 1
+title: C API
+---
+
+# C API reference
+
+Полный listing public C API из `<cuframes/cuframes.h>` (libcuframes 0.4.0). Source of truth — header в repo, эта страница его дублирует в Docusaurus формате с cross-links на концептуальные разделы.
+
+## Headers & linkage
+
+```c
+#include <cuframes/cuframes.h>
+```
+
+```bash
+# pkg-config (если установлено через .deb)
+cc app.c $(pkg-config --cflags --libs cuframes)
+
+# вручную
+cc app.c -lcuframes
+```
+
+`libcuframes.so.0` динамически линкуется к `libcuda.so.1` (CUDA driver API, не runtime). Для большинства user-кода также нужен `-lcudart` чтобы манипулировать своими CUDA streams.
+
+## Conventions
+
+- Все функции возвращают `int` — `0` (CUFRAMES_OK) при успехе, отрицательный код из [`cuframes_error_t`](#error-codes) при ошибке. Расшифровка кода — [`cuframes_strerror`](#error-decoding).
+- Все handle types (`cuframes_publisher_t`, `cuframes_subscriber_t`, `cuframes_frame_t`, `cuframes_packet_t`) — **opaque**. Поля не доступны напрямую, только через accessor-функции. Это даёт ABI-stability в minor релизах.
+- Каждый handle принадлежит **одному потоку**. Cross-thread access — undefined behavior. Несколько handle'ов в разных потоках — OK.
+- Endianness — little-endian (это и так фиксируется CUDA-платформами).
+
+## Version & error codes
+
+### Library version
+
+```c
+const char *cuframes_version_string(void);
+uint32_t    cuframes_protocol_version(void);
+```
+
+`cuframes_version_string` возвращает runtime-версию libcuframes в формате `"MAJOR.MINOR.PATCH"` (например `"0.4.0"`). Compile-time константы:
+
+```c
+#define CUFRAMES_VERSION_MAJOR 0
+#define CUFRAMES_VERSION_MINOR 4
+#define CUFRAMES_VERSION_PATCH 0
+```
+
+`cuframes_protocol_version` возвращает wire-protocol версию (для v0.4 — `4`). Subscribers с другим protocol version не подключатся — publisher вернёт `HELLO_RESP(result=CUFRAMES_ERR_PROTOCOL)`. См. [Protocol reference](/docs/reference/protocol).
+
+### Error codes
+
+```c
+typedef enum cuframes_error {
+    CUFRAMES_OK                     =    0,
+    CUFRAMES_ERR_INVALID_ARG        =   -1,
+    CUFRAMES_ERR_OUT_OF_MEMORY      =   -2,
+    CUFRAMES_ERR_CUDA               =   -3,
+    CUFRAMES_ERR_IO                 =   -4,
+    CUFRAMES_ERR_NOT_FOUND          =   -5,
+    CUFRAMES_ERR_ALREADY_EXISTS     =   -6,
+    CUFRAMES_ERR_TIMEOUT            =   -7,
+    CUFRAMES_ERR_PROTOCOL           =   -8,
+    CUFRAMES_ERR_DISCONNECTED       =   -9,
+    CUFRAMES_ERR_FORMAT             =  -10,
+    CUFRAMES_ERR_WOULD_BLOCK        =  -11,
+    CUFRAMES_ERR_TOO_MANY           =  -12,
+    CUFRAMES_ERR_PACKET_OVERSIZED   =  -20,
+    CUFRAMES_ERR_NO_PACKET_RING     =  -21,
+    CUFRAMES_ERR_NO_CODEC_PARAMS    =  -22,
+    CUFRAMES_ERR_PACKET_OVERRUN     =  -23,
+    CUFRAMES_ERR_INTERNAL           = -100,
+} cuframes_error_t;
+```
+
+| Code | Name | Meaning |
+|---|---|---|
+| `0` | `CUFRAMES_OK` | Success |
+| `-1` | `CUFRAMES_ERR_INVALID_ARG` | NULL pointer или невалидное значение в config |
+| `-2` | `CUFRAMES_ERR_OUT_OF_MEMORY` | malloc / cudaMalloc fail |
+| `-3` | `CUFRAMES_ERR_CUDA` | Ошибка CUDA runtime / driver |
+| `-4` | `CUFRAMES_ERR_IO` | socket / mmap / eventfd |
+| `-5` | `CUFRAMES_ERR_NOT_FOUND` | Publisher с таким key не найден |
+| `-6` | `CUFRAMES_ERR_ALREADY_EXISTS` | Publisher с этим key уже есть, либо `consumer_name` занят |
+| `-7` | `CUFRAMES_ERR_TIMEOUT` | Операция не завершилась за timeout |
+| `-8` | `CUFRAMES_ERR_PROTOCOL` | Несовместимая версия wire protocol |
+| `-9` | `CUFRAMES_ERR_DISCONNECTED` | Publisher died или сеть оборвалась |
+| `-10` | `CUFRAMES_ERR_FORMAT` | Неподдерживаемый pixel format или несовпадение размеров |
+| `-11` | `CUFRAMES_ERR_WOULD_BLOCK` | Non-blocking call — данных пока нет |
+| `-12` | `CUFRAMES_ERR_TOO_MANY` | Превышен `MAX_SUBSCRIBERS` (32) |
+| `-20` | `CUFRAMES_ERR_PACKET_OVERSIZED` | `publish_packet` size > `max_packet_size` |
+| `-21` | `CUFRAMES_ERR_NO_PACKET_RING` | Subscriber запросил packets, у publisher'а нет ring'а |
+| `-22` | `CUFRAMES_ERR_NO_CODEC_PARAMS` | Extradata ещё не set publisher'ом |
+| `-23` | `CUFRAMES_ERR_PACKET_OVERRUN` | Slow subscriber, packet seq уехал — resync на keyframe |
+| `-100` | `CUFRAMES_ERR_INTERNAL` | Bug в библиотеке — repro и report'ить |
+
+### Error decoding
+
+```c
+const char *cuframes_strerror(int err);
+```
+
+Возвращает human-readable строку для error code. Pointer указывает на static storage, дальше владеть им не надо. Никогда не возвращает NULL — для unknown code вернёт `"unknown error"`.
+
+## Pixel formats
+
+```c
+typedef enum cuframes_format {
+    CUFRAMES_FORMAT_NV12      = 0,
+    CUFRAMES_FORMAT_YUV420P   = 1,
+    CUFRAMES_FORMAT_RGB       = 2,
+    CUFRAMES_FORMAT_BGR       = 3,
+    CUFRAMES_FORMAT_RGBA      = 4,
+    CUFRAMES_FORMAT_GRAYSCALE = 5,
+} cuframes_format_t;
+```
+
+| Format | Layout | Когда |
+|---|---|---|
+| `NV12` | Y plane + interleaved UV plane | NVDEC native, default для video pipeline'ов |
+| `YUV420P` | Y + U + V separate planes | FFmpeg `yuv420p` |
+| `RGB` | 24bpp packed RGB | ML inference, OpenGL |
+| `BGR` | 24bpp packed BGR | OpenCV native |
+| `RGBA` | 32bpp packed RGBA | overlays, compositing |
+| `GRAYSCALE` | 8bpp single plane | depth maps, masks |
+
+Format фиксирован для publisher'а в момент create — поменять нельзя без destroy + recreate с новым key.
+
+## Policy & mode enums
+
+### Publisher policy
+
+```c
+typedef enum cuframes_publisher_policy {
+    CUFRAMES_POLICY_DROP_OLDEST = 0,
+    CUFRAMES_POLICY_STRICT_WAIT = 1,
+} cuframes_publisher_policy_t;
+```
+
+- `DROP_OLDEST` — publisher не ждёт, перезаписывает next slot. Slow consumer пропускает кадры. **Default для real-time.**
+- `STRICT_WAIT` — publisher блокируется пока все подписанные consumers не ACK'нут. Не теряет кадры, но slowest consumer тормозит всех. Для recording или критичной аналитики.
+
+### Subscriber mode
+
+```c
+typedef enum cuframes_subscriber_mode {
+    CUFRAMES_MODE_NEWEST_ONLY  = 0,
+    CUFRAMES_MODE_STRICT_ORDER = 1,
+} cuframes_subscriber_mode_t;
+```
+
+- `NEWEST_ONLY` — брать самый свежий frame, пропускать промежуточные. **Default.**
+- `STRICT_ORDER` — все frames по порядку. Если ring overflow — вернётся `CUFRAMES_ERR_DISCONNECTED`, нужно reconnect.
+
+### Ownership mode
+
+```c
+typedef enum cuframes_ownership_mode {
+    CUFRAMES_OWNERSHIP_LIBRARY  = 0,
+    CUFRAMES_OWNERSHIP_EXTERNAL = 1,
+} cuframes_ownership_mode_t;
+```
+
+- `LIBRARY` — library владеет VMM pool'ом (см. [Sync model](/docs/concepts/sync-vmm-stream)). Publisher делает `acquire()` → пишет → `publish()`. **Единственный поддерживаемый mode в v0.4.**
+- `EXTERNAL` — **в v0.4 deprecated.** `cuframes_publisher_create_external` возвращает `CUFRAMES_ERR_INVALID_ARG`. Для FFmpeg filter integration используйте `LIBRARY` + одна D2D копия в `acquire()`'нутый slot (cuframes-rtsp-source именно так и работает с v0.4).
+
+## Frame accessors
+
+`cuframes_frame_t` — opaque handle на frame полученный у subscriber'а. Валиден от `cuframes_subscriber_next` до `cuframes_subscriber_release`.
+
+```c
+typedef struct cuframes_frame cuframes_frame_t;
+
+void *cuframes_frame_cuda_ptr(const cuframes_frame_t *frame);
+cuframes_format_t cuframes_frame_format(const cuframes_frame_t *frame);
+void cuframes_frame_size(const cuframes_frame_t *frame,
+                         int32_t *width_out, int32_t *height_out);
+int32_t cuframes_frame_pitch_y(const cuframes_frame_t *frame);
+int32_t cuframes_frame_pitch_uv(const cuframes_frame_t *frame);
+uint64_t cuframes_frame_seq(const cuframes_frame_t *frame);
+int64_t cuframes_frame_pts_ns(const cuframes_frame_t *frame);
+```
+
+| Function | Returns |
+|---|---|
+| `cuda_ptr` | CUDA device pointer на frame data (read-only для consumer'а) |
+| `format` | `cuframes_format_t` |
+| `size` | Ширина и высота в пикселях через out-параметры |
+| `pitch_y` | Pitch (байт на строку) для Y plane или единственного plane |
+| `pitch_uv` | Pitch для UV plane (NV12 / YUV420P); `0` для форматов без UV |
+| `seq` | Sequence number — монотонная нумерация у publisher'а |
+| `pts_ns` | Timestamp publisher'а (наносекунды, `CLOCK_MONOTONIC`) |
+
+PTS epoch caveat: publisher и consumer могут иметь разные эпохи `CLOCK_MONOTONIC` (после publisher restart — counter сбрасывается). Consumer должен sanity-checkить, например detect epoch change когда `pts_ns_curr < pts_ns_prev`.
+
+## Publisher API
+
+### Config struct
+
+```c
+typedef struct cuframes_publisher_config {
+    const char *key;                  /* unique имя ("cam1"). Не NULL. */
+    int32_t width;
+    int32_t height;
+    cuframes_format_t format;
+    cuframes_ownership_mode_t ownership;
+    int32_t ring_size;                /* 2..16, рекомендуется 4 */
+    cuframes_publisher_policy_t policy;
+    int32_t consumer_ack_timeout_ms;  /* STRICT_WAIT; 0 = ждать вечно */
+    int32_t cuda_device;
+    uint64_t _reserved[4];            /* должно быть 0 */
+} cuframes_publisher_config_t;
+```
+
+| Field | Constraints |
+|---|---|
+| `key` | ASCII `[a-zA-Z0-9_-]{1,63}`. Не NULL. |
+| `width`, `height` | Pixels. Фиксированы после create. |
+| `format` | См. [Pixel formats](#pixel-formats). Фиксирован. |
+| `ownership` | В v0.4 — только `LIBRARY`. |
+| `ring_size` | 2..16 для `LIBRARY`. Меньше — больше chance overrun, больше — больше VRAM. |
+| `policy` | См. [Policy](#publisher-policy). |
+| `consumer_ack_timeout_ms` | Только для `STRICT_WAIT`. `0` = ждать бесконечно. |
+| `cuda_device` | Обычно `0`. Должен совпадать с consumer'ским. |
+| `_reserved` | Reserved для ABI-stability, должно быть нулями. |
+
+### Create / destroy
+
+```c
+int cuframes_publisher_create(const cuframes_publisher_config_t *cfg,
+                              cuframes_publisher_t **out);
+
+int cuframes_publisher_create_external(const cuframes_publisher_config_t *cfg,
+                                       void *const *cuda_ptrs,
+                                       int32_t ptr_count,
+                                       size_t frame_size,
+                                       cuframes_publisher_t **out);
+
+int cuframes_publisher_destroy(cuframes_publisher_t *pub);
+```
+
+`cuframes_publisher_create` аллоцирует `ring_size` × `frame_size` через `cuMemCreate(POSIX_FILE_DESCRIPTOR)`, открывает Unix socket `/run/cuframes/<key>.sock`, mmap'ит `/dev/shm/cuframes-<key>`. См. [Synchronization & VMM stream](/docs/concepts/sync-vmm-stream).
+
+Errors:
+
+| Code | Когда |
+|---|---|
+| `INVALID_ARG` | `cfg` NULL, ring_size out of range, key не проходит regex |
+| `ALREADY_EXISTS` | Publisher с этим key уже есть и его процесс живой |
+| `CUDA` | `cuMemCreate` fail (out of VRAM, unsupported driver) |
+| `IO` | Не получилось `bind()` socket или `shm_open()` |
+
+`cuframes_publisher_create_external` — **в v0.4 возвращает `CUFRAMES_ERR_INVALID_ARG`**. EXTERNAL ownership убран потому что VMM требует `cuMemCreate`-allocated memory. Для упомянутого FFmpeg filter use case — переходите на `LIBRARY` + одна `cudaMemcpyAsync(D2D)` в acquire'нутый slot. Cuframes-rtsp-source работает именно так начиная с v0.4.
+
+`cuframes_publisher_destroy` шлёт `SHUTDOWN` всем connected subscribers, unlink'ает socket и shm. NULL-safe.
+
+### Publish (LIBRARY mode)
+
+```c
+int cuframes_publisher_acquire(cuframes_publisher_t *pub,
+                               void **cuda_ptr_out);
+
+int cuframes_publisher_publish(cuframes_publisher_t *pub,
+                               void *stream,    /* cudaStream_t */
+                               int64_t pts_ns);
+```
+
+`acquire` возвращает CUDA device pointer на следующий slot в ring'е для записи. Pointer стабилен пока вы держите ring slot — обычно до следующего `publish`.
+
+Errors:
+
+| Code | Когда |
+|---|---|
+| `TIMEOUT` | Все slots заняты в `STRICT_WAIT` mode |
+| `INVALID_ARG` | `pub` NULL, или publisher был создан в EXTERNAL mode |
+
+`publish` финализирует acquire'нутый slot. Внутри: `cuStreamSynchronize(stream)` гарантирует что producer's writes hardware-coherent, затем atomic update `slot.seq` + `global_seq`. См. [Synchronization](/docs/concepts/sync-vmm-stream) — почему именно stream sync, а не CUDA events.
+
+| Param | Meaning |
+|---|---|
+| `stream` | CUDA stream на котором писались данные. `0` для default stream. |
+| `pts_ns` | Timestamp, рекомендуется [`cuframes_now_ns()`](#utils). |
+
+### Publish (EXTERNAL mode)
+
+```c
+int cuframes_publisher_publish_external(cuframes_publisher_t *pub,
+                                        void *cuda_ptr,
+                                        void *stream,
+                                        int64_t pts_ns);
+```
+
+**В v0.4 deprecated** — see note про `create_external` выше. Всегда возвращает `CUFRAMES_ERR_INVALID_ARG`.
+
+## Subscriber API (sync)
+
+### Config struct
+
+```c
+typedef struct cuframes_subscriber_config {
+    const char *key;
+    const char *consumer_name;        /* NULL = auto */
+    cuframes_subscriber_mode_t mode;
+    int32_t cuda_device;
+    int32_t connect_timeout_ms;       /* 0=fail, -1=ждать вечно */
+    uint64_t _reserved[4];
+} cuframes_subscriber_config_t;
+```
+
+| Field | Constraints |
+|---|---|
+| `key` | Должен совпадать с publisher'ским |
+| `consumer_name` | Если NULL — library сгенерирует `subscriber-<pid>-<random>`. Unique в пределах publisher'а — иначе `ALREADY_EXISTS`. MAX 32 subscribers. |
+| `mode` | См. [Subscriber mode](#subscriber-mode) |
+| `cuda_device` | Должен совпадать с publisher'ским — VMM FD импортируется на тот же device |
+| `connect_timeout_ms` | `0` = fail сразу с `NOT_FOUND`; `-1` = ждать вечно |
+
+### Create / destroy
+
+```c
+int cuframes_subscriber_create(const cuframes_subscriber_config_t *cfg,
+                               cuframes_subscriber_t **out);
+
+int cuframes_subscriber_destroy(cuframes_subscriber_t *sub);
+```
+
+`create` выполняет handshake (`HELLO` → `SUBSCRIBE` → `VMM_FDS`), импортирует N file descriptors через `cuMemImportFromShareableHandle`. См. [Protocol reference §3](/docs/reference/protocol).
+
+Errors:
+
+| Code | Когда |
+|---|---|
+| `NOT_FOUND` | Publisher с этим key не найден до `connect_timeout_ms` |
+| `PROTOCOL` | Publisher имеет другую protocol version |
+| `TOO_MANY` | Publisher уже имеет 32 subscriber'а |
+| `ALREADY_EXISTS` | `consumer_name` занят |
+| `CUDA` | `cuMemImportFromShareableHandle` fail |
+
+`destroy` — graceful close: `UNSUBSCRIBE` msg → cleanup VMM mappings → close socket. NULL-safe.
+
+### Next frame
+
+```c
+int cuframes_subscriber_next(cuframes_subscriber_t *sub,
+                             void *consumer_stream,
+                             cuframes_frame_t **frame_out,
+                             int32_t timeout_ms);
+
+int cuframes_subscriber_release(cuframes_subscriber_t *sub,
+                                cuframes_frame_t *frame);
+```
+
+`next` блокируется до `timeout_ms` ожидая новый frame. Семантика по mode:
+
+- `NEWEST_ONLY` — возвращает самый свежий frame, пропускает промежуточные;
+- `STRICT_ORDER` — следующий по seq; `DISCONNECTED` при overflow.
+
+`consumer_stream` — ваш CUDA stream, на котором будете читать frame. В v0.4 синхронизация делается на стороне publisher'а через `cuStreamSynchronize`, так что параметр зарезервирован для будущего event-based fast path и сейчас не обязателен (`0` допустимо).
+
+| Param | Meaning |
+|---|---|
+| `consumer_stream` | CUDA stream consumer'а. `0` допустимо. |
+| `frame_out` | Output handle. Освободить через `release`. |
+| `timeout_ms` | `<0` = блокироваться, `0` = non-blocking (вернёт `WOULD_BLOCK`), `>0` = с timeout'ом |
+
+Errors:
+
+| Code | Когда |
+|---|---|
+| `WOULD_BLOCK` | `timeout_ms=0` и нет данных |
+| `TIMEOUT` | За `timeout_ms` ничего не пришло |
+| `DISCONNECTED` | Publisher shutdown, либо ring overrun в `STRICT_ORDER` |
+
+`release` ACK'ает frame publisher'у (важно для `STRICT_WAIT` policy). NULL-safe. После release frame handle invalid.
+
+## Subscriber API (async)
+
+```c
+typedef void (*cuframes_frame_callback_t)(const cuframes_frame_t *frame,
+                                          void *user_data);
+typedef void (*cuframes_error_callback_t)(int err, const char *msg,
+                                          void *user_data);
+
+int cuframes_async_subscriber_create(const cuframes_subscriber_config_t *cfg,
+                                     cuframes_frame_callback_t on_frame,
+                                     cuframes_error_callback_t on_error,
+                                     void *user_data,
+                                     cuframes_async_subscriber_t **out);
+
+int cuframes_async_subscriber_destroy(cuframes_async_subscriber_t *sub);
+```
+
+Callback-based wrapper над sync API. Library поднимает internal thread, который sit'ит на `next`, вызывает `on_frame` / `on_error`, сам делает `release` после возврата из callback.
+
+Constraints:
+
+- Frame **валиден только в течение callback'а** — никаких saved pointer'ов;
+- Library использует internal CUDA stream, pre-wait уже выполнен — для своего stream'а используйте sync API;
+- `destroy` joins internal thread и гарантирует что callback больше не вызовется после возврата (может занять до длительности текущего callback'а).
+
+## Packet ring API
+
+См. [Frame vs Packet ring](/docs/concepts/frame-vs-packet-ring) — когда нужно использовать packet ring.
+
+### Flags
+
+```c
+#define CUFRAMES_PKT_FLAG_KEY            0x01u
+#define CUFRAMES_PKT_FLAG_CORRUPT        0x02u
+#define CUFRAMES_PKT_FLAG_DISCONTINUITY  0x04u
+#define CUFRAMES_PKT_FLAG_LAST_IN_AU     0x08u
+```
+
+Биты соответствуют `AV_PKT_FLAG_*` у FFmpeg.
+
+### Publisher-side
+
+```c
+typedef struct cuframes_packet_ring_options {
+    uint32_t ring_slots;       /* default 64 */
+    uint32_t data_size;        /* default 8 MiB */
+    uint32_t max_packet_size;  /* default 2 MiB */
+    uint32_t codec_id;         /* AV_CODEC_ID_* */
+    uint64_t _reserved[4];
+} cuframes_packet_ring_options_t;
+
+int cuframes_publisher_enable_packets(cuframes_publisher_t *pub,
+                                      const cuframes_packet_ring_options_t *opts);
+
+int cuframes_publisher_set_codec_extradata(cuframes_publisher_t *pub,
+                                           const void *extradata, size_t size);
+
+int cuframes_publisher_publish_packet(cuframes_publisher_t *pub,
+                                      const void *data, size_t size,
+                                      int64_t pts_ns, int64_t dts_ns,
+                                      uint32_t flags);
+```
+
+`enable_packets` создаёт отдельный SHM `/dev/shm/cuframes-<key>-packets`. **Должно быть вызвано до первого `publish_packet` и желательно до того как subscribers начнут подключаться** — иначе subscriber увидит publisher без ring'а и не получит packets. `opts=NULL` → default sizing.
+
+`set_codec_extradata` пишет SPS/PPS/VPS bytes в shared header. Subscribers (FFmpeg demuxer) подставят это в `AVCodecContext.extradata`. Size ≤ 4096 байт.
+
+`publish_packet` записывает один NAL unit (Annex B). На IDR обязательно ставить `CUFRAMES_PKT_FLAG_KEY` — иначе late subscriber не сможет resync'нуться.
+
+Errors:
+
+| Code | Когда |
+|---|---|
+| `NO_PACKET_RING` | Не вызвали `enable_packets` |
+| `PACKET_OVERSIZED` | `size > max_packet_size` |
+| `ALREADY_EXISTS` | (`enable_packets`) ring уже активирован |
+
+### Subscriber-side
+
+```c
+typedef struct cuframes_packet cuframes_packet_t;
+
+const void *cuframes_packet_data(const cuframes_packet_t *p);
+size_t      cuframes_packet_size(const cuframes_packet_t *p);
+int64_t     cuframes_packet_pts(const cuframes_packet_t *p);
+int64_t     cuframes_packet_dts(const cuframes_packet_t *p);
+uint32_t    cuframes_packet_flags(const cuframes_packet_t *p);
+uint64_t    cuframes_packet_seq(const cuframes_packet_t *p);
+
+int cuframes_subscriber_enable_packets(cuframes_subscriber_t *sub);
+
+int cuframes_subscriber_next_packet(cuframes_subscriber_t *sub,
+                                    cuframes_packet_t **pkt_out,
+                                    int32_t timeout_ms);
+
+int cuframes_subscriber_release_packet(cuframes_subscriber_t *sub,
+                                       cuframes_packet_t *pkt);
+
+int cuframes_subscriber_get_codec_params(cuframes_subscriber_t *sub,
+                                         uint32_t *codec_id_out,
+                                         const void **extradata_out,
+                                         size_t *extradata_size_out);
+```
+
+`enable_packets` открывает второй SHM (если publisher его создал). Subscriber может одновременно иметь frames ring и packets ring, или только один из них.
+
+`next_packet` — late subscriber на первом вызове начнёт с `last_keyframe_seq` publisher'а (decoder получит valid stream без glitches). См. [Protocol §10.14](/docs/reference/protocol).
+
+Errors:
+
+| Code | Когда |
+|---|---|
+| `WOULD_BLOCK` | `timeout_ms=0`, нет данных |
+| `TIMEOUT` | За `timeout_ms` ничего не пришло |
+| `PACKET_OVERRUN` | Subscriber отстал; library автоматически resync'нется на keyframe на next call |
+| `DISCONNECTED` | Publisher shutdown |
+| `NOT_FOUND` | (`enable_packets`) publisher не имеет packet ring |
+
+`get_codec_params` возвращает pointer в library-local buffer, валидный пока subscriber жив. Если данных хотите hold past subscriber lifetime — копируйте сами. Возвращает `NO_CODEC_PARAMS` если publisher ещё не звал `set_codec_extradata`.
+
+`release_packet` — NULL-safe. После release pointer'ы от `cuframes_packet_*` invalid.
+
+## Utils
+
+### Frame size calculation
+
+```c
+int cuframes_calc_frame_size(cuframes_format_t format,
+                             int32_t width, int32_t height,
+                             size_t *size_out,
+                             int32_t *pitch_y_out,
+                             int32_t *pitch_uv_out);
+```
+
+Учитывает pitch alignment 256 байт (CUDA recommendation). `pitch_y_out` / `pitch_uv_out` опциональны (можно NULL). Возвращает `INVALID_ARG` для unknown format.
+
+### Monotonic time
+
+```c
+int64_t cuframes_now_ns(void);
+```
+
+`CLOCK_MONOTONIC` в наносекундах. Используйте как `pts_ns` для real-time pipeline'ов:
+
+```c
+cuframes_publisher_publish(pub, stream, cuframes_now_ns());
+```
+
+## See also
+
+- [C++ API](/docs/reference/api-cpp) — RAII wrapper.
+- [Protocol reference](/docs/reference/protocol) — wire format, handshake, ABI layouts.
+- [Frame vs Packet ring](/docs/concepts/frame-vs-packet-ring) — когда использовать какой.
+- [Synchronization](/docs/concepts/sync-vmm-stream) — почему `cuStreamSynchronize`, а не CUDA events.
+- [First publisher](/docs/getting-started/first-publisher) — работающий C-пример.
@@ -0,0 +1,351 @@
+---
+sidebar_position: 2
+title: C++ API
+---
+
+# C++ API reference
+
+`<cuframes/cuframes.hpp>` — header-only RAII wrapper над [C API](/docs/reference/api-c). Тонкий слой: handle-классы с automatic cleanup, exceptions вместо int return codes, `std::optional<FrameRef>` для `next`.
+
+## Headers & linkage
+
+```cpp
+#include <cuframes/cuframes.hpp>
+```
+
+```bash
+# C++17 минимум (нужен std::optional)
+c++ -std=c++17 app.cpp -lcuframes
+# при использовании своих CUDA streams
+c++ -std=c++17 app.cpp -lcuframes -lcudart
+```
+
+Header-only — самой C++ библиотеки не существует, есть только wrapper над `libcuframes.so`. ABI-совместимость наследуется от C API.
+
+Всё в namespace `cuframes`:
+
+```cpp
+namespace cuframes {
+    class Error;
+    class Frame;
+    class FrameRef;
+    class Publisher;
+    class Subscriber;
+    class AsyncSubscriber;
+    struct PublisherOptions;
+    struct SubscriberOptions;
+    inline int64_t now_ns();
+    inline size_t calc_frame_size(...);
+}
+```
+
+## Exceptions
+
+```cpp
+class cuframes::Error : public std::runtime_error {
+public:
+    Error(int code, const std::string &context);
+    int code() const noexcept;
+};
+```
+
+Бросается из всех методов кроме explicitly `noexcept`. `code()` — оригинальный [`cuframes_error_t`](/docs/reference/api-c#error-codes); `what()` — `"<context>: <strerror>"`.
+
+```cpp
+try {
+    cuframes::Publisher pub({.key = "cam1", .width = 1920, .height = 1080});
+} catch (const cuframes::Error &e) {
+    if (e.code() == CUFRAMES_ERR_ALREADY_EXISTS) {
+        // stale publisher с этим key
+    }
+}
+```
+
+## Frame & FrameRef
+
+`Frame` — read-only non-owning view над `cuframes_frame_t`. Используется в callback'ах async subscriber'а.
+
+`FrameRef` — RAII owning handle: `release` вызывается автоматически в destructor'е. Moveable, не copyable.
+
+```cpp
+class Frame {
+public:
+    void *cuda_ptr() const noexcept;
+    cuframes_format_t format() const noexcept;
+    int32_t width() const noexcept;
+    int32_t height() const noexcept;
+    int32_t pitch_y() const noexcept;
+    int32_t pitch_uv() const noexcept;
+    uint64_t seq() const noexcept;
+    int64_t pts_ns() const noexcept;
+    const cuframes_frame_t *raw() const noexcept;
+};
+
+class FrameRef {
+public:
+    explicit operator bool() const noexcept;
+    Frame view() const noexcept;
+    // Shortcut accessors: cuda_ptr, width, height, pitch_y, pitch_uv, seq, pts_ns
+};
+```
+
+Все accessor-методы `noexcept` — они идут в C accessor'ы которые ничего не аллоцируют.
+
+## Publisher
+
+```cpp
+struct PublisherOptions {
+    std::string key;
+    int32_t width = 0;
+    int32_t height = 0;
+    cuframes_format_t format = CUFRAMES_FORMAT_NV12;
+    int32_t ring_size = 4;
+    cuframes_publisher_policy_t policy = CUFRAMES_POLICY_DROP_OLDEST;
+    int32_t consumer_ack_timeout_ms = 0;
+    int32_t cuda_device = 0;
+};
+
+class Publisher {
+public:
+    explicit Publisher(const PublisherOptions &opt);
+
+    // EXTERNAL ownership — DEPRECATED в v0.4, бросает Error(INVALID_ARG)
+    Publisher(const PublisherOptions &opt,
+              void *const *cuda_ptrs, int32_t ptr_count, size_t frame_size);
+
+    ~Publisher();
+    Publisher(Publisher &&) noexcept;
+    Publisher &operator=(Publisher &&) noexcept;
+
+    void *acquire();
+    void publish(void *stream, int64_t pts_ns);
+
+    // EXTERNAL mode — DEPRECATED в v0.4
+    void publish_external(void *cuda_ptr, void *stream, int64_t pts_ns);
+
+    // Packet ring
+    void enable_packets(const cuframes_packet_ring_options_t *opts = nullptr);
+    void set_codec_extradata(const void *data, size_t size);
+    int publish_packet(const void *data, size_t size,
+                       int64_t pts_ns, int64_t dts_ns, uint32_t flags) noexcept;
+
+    cuframes_publisher_t *raw() noexcept;
+};
+```
+
+**Note про deprecated EXTERNAL ownership constructor.** В v0.4 второй конструктор `Publisher(opt, cuda_ptrs, ...)` под капотом вызывает `cuframes_publisher_create_external` и сразу получает `INVALID_ARG` → бросает `cuframes::Error`. Для FFmpeg filter / custom decoder integration переходите на LIBRARY mode + одна `cudaMemcpyAsync(D2D)` в `acquire()`'нутый pointer. См. [C API note](/docs/reference/api-c#create--destroy).
+
+`publish_packet` — единственный non-throwing метод (возвращает int). Это сделано чтобы в hot loop encoder'а не платить за exception unwind на каждом packet.
+
+Минимальный publisher loop:
+
+```cpp
+cuframes::Publisher pub({
+    .key = "cam1",
+    .width = 1920, .height = 1080,
+    .format = CUFRAMES_FORMAT_NV12,
+});
+
+cudaStream_t stream;
+cudaStreamCreate(&stream);
+
+for (;;) {
+    void *slot = pub.acquire();
+    // ... NVDEC / kernel пишут в slot на `stream` ...
+    pub.publish(stream, cuframes::now_ns());
+}
+```
+
+## Subscriber (sync)
+
+```cpp
+struct SubscriberOptions {
+    std::string key;
+    std::string consumer_name;        // empty = auto-generate
+    cuframes_subscriber_mode_t mode = CUFRAMES_MODE_NEWEST_ONLY;
+    int32_t cuda_device = 0;
+    int32_t connect_timeout_ms = 5000;
+};
+
+class Subscriber {
+public:
+    explicit Subscriber(const SubscriberOptions &opt);
+    ~Subscriber();
+    Subscriber(Subscriber &&) noexcept;
+    Subscriber &operator=(Subscriber &&) noexcept;
+
+    std::optional<FrameRef> next(void *stream, int32_t timeout_ms = -1);
+
+    cuframes_subscriber_t *raw() noexcept;
+};
+```
+
+`next` возвращает `std::nullopt` для recoverable conditions (`TIMEOUT`, `WOULD_BLOCK`, `DISCONNECTED`) и бросает `Error` для всего остального. Эта асимметрия сделана сознательно — три перечисленных случая ожидаемы в обычном loop'е и не должны платить за exception unwind.
+
+```cpp
+cuframes::Subscriber sub({
+    .key = "cam1",
+    .mode = CUFRAMES_MODE_NEWEST_ONLY,
+});
+
+cudaStream_t stream;
+cudaStreamCreate(&stream);
+
+while (auto frame = sub.next(stream, 1000)) {
+    // frame->cuda_ptr(), frame->width(), frame->pts_ns()
+    // release будет автоматически при выходе из scope
+}
+```
+
+## Subscriber (async)
+
+```cpp
+class AsyncSubscriber {
+public:
+    using OnFrame = std::function<void(const Frame &)>;
+    using OnError = std::function<void(int err, const std::string &msg)>;
+
+    AsyncSubscriber(const SubscriberOptions &opt,
+                    OnFrame on_frame,
+                    OnError on_error = {});
+    ~AsyncSubscriber();
+};
+```
+
+Callback-based wrapper. Library поднимает internal thread; `Frame` валиден **только** в течение callback'а (автоматический release после return).
+
+```cpp
+cuframes::AsyncSubscriber sub(
+    {.key = "cam1"},
+    [](const cuframes::Frame &f) {
+        std::printf("seq=%lu pts=%ld\n", f.seq(), f.pts_ns());
+    },
+    [](int err, const std::string &msg) {
+        std::fprintf(stderr, "cuframes error %d: %s\n", err, msg.c_str());
+    });
+
+// держим subscriber alive...
+std::this_thread::sleep_for(std::chrono::seconds(60));
+// destructor join'ит worker thread
+```
+
+Не copyable. Move в текущей версии тоже запрещён (поля `std::function` хранят `this`-pointer для trampoline'а — move сломает).
+
+## Packet ring
+
+В C++ wrapper'е packet ring доступен через прямые C-функции и `Publisher::publish_packet`. Отдельных `cuframes::Packet` / `cuframes::PacketRef` классов нет — packet API проще и FFmpeg interop часто пишется напрямую через C.
+
+```cpp
+// Publisher-side
+cuframes_packet_ring_options_t pkt_opts{};
+pkt_opts.ring_slots = 64;
+pkt_opts.data_size = 8 * 1024 * 1024;
+pkt_opts.max_packet_size = 2 * 1024 * 1024;
+pkt_opts.codec_id = AV_CODEC_ID_H264;
+pub.enable_packets(&pkt_opts);
+pub.set_codec_extradata(sps_pps.data(), sps_pps.size());
+
+int rc = pub.publish_packet(nal, nal_size, pts, dts, CUFRAMES_PKT_FLAG_KEY);
+if (rc < 0 && rc != CUFRAMES_ERR_PACKET_OVERSIZED) {
+    // log + skip; OVERSIZED безопасно игнорировать
+}
+```
+
+Subscriber-side — pure C функции из `<cuframes/cuframes.h>`, см. [Packet subscriber API](/docs/reference/api-c#subscriber-side).
+
+## Utilities
+
+```cpp
+inline int64_t cuframes::now_ns();
+
+inline size_t cuframes::calc_frame_size(cuframes_format_t format,
+                                        int32_t w, int32_t h,
+                                        int32_t *pitch_y = nullptr,
+                                        int32_t *pitch_uv = nullptr);
+```
+
+`calc_frame_size` бросает `Error` на unknown format (в отличие от C-варианта который возвращает код).
+
+## Examples
+
+### Complete publisher (LIBRARY mode)
+
+```cpp
+#include <cuframes/cuframes.hpp>
+#include <cuda_runtime.h>
+
+int main() {
+    cuframes::Publisher pub({
+        .key = "cam1",
+        .width = 1920, .height = 1080,
+        .format = CUFRAMES_FORMAT_NV12,
+        .ring_size = 4,
+    });
+
+    cudaStream_t stream;
+    cudaStreamCreate(&stream);
+
+    for (int i = 0; i < 1000; i++) {
+        void *slot = pub.acquire();
+        // ... NVDEC decode или kernel write в slot ...
+        pub.publish(stream, cuframes::now_ns());
+    }
+}
+```
+
+### Complete subscriber
+
+```cpp
+#include <cuframes/cuframes.hpp>
+#include <cuda_runtime.h>
+
+int main() {
+    cuframes::Subscriber sub({
+        .key = "cam1",
+        .consumer_name = "my-detector",
+        .mode = CUFRAMES_MODE_NEWEST_ONLY,
+        .connect_timeout_ms = 5000,
+    });
+
+    cudaStream_t stream;
+    cudaStreamCreate(&stream);
+
+    for (;;) {
+        auto frame = sub.next(stream, 1000);
+        if (!frame) continue;  // timeout/disconnect
+
+        // frame->cuda_ptr() — VRAM pointer
+        // frame->width(), frame->height() — pixels
+        // frame->pitch_y(), frame->pitch_uv() — байт на строку
+        // ... ML inference / CUDA filter на stream ...
+    }
+}
+```
+
+### Async with lambdas
+
+```cpp
+#include <cuframes/cuframes.hpp>
+#include <atomic>
+#include <thread>
+
+int main() {
+    std::atomic<uint64_t> frames{0};
+
+    cuframes::AsyncSubscriber sub(
+        {.key = "cam1"},
+        [&](const cuframes::Frame &f) {
+            frames.fetch_add(1);
+            // pre-sync уже выполнен library-side; f.cuda_ptr() ready
+        });
+
+    std::this_thread::sleep_for(std::chrono::seconds(10));
+    std::printf("processed %lu frames\n", frames.load());
+}
+```
+
+## See also
+
+- [C API](/docs/reference/api-c) — underlying C functions.
+- [Protocol reference](/docs/reference/protocol) — wire format spec.
+- [First publisher](/docs/getting-started/first-publisher) — minimal end-to-end пример.
@@ -0,0 +1,608 @@
+---
+sidebar_position: 3
+title: Wire protocol v4 spec
+---
+
+# cuframes wire protocol — version 4
+
+**Status:** v4 — production (deployed 2026-05-25). Semver-stable в пределах major.
+**Endianness:** little-endian (CUDA-platforms only).
+**libcuframes version:** 0.4.0.
+
+Эта страница описывает **byte-exact** формат всех структур и сообщений cuframes. Любая реализация (C, Python ctypes, Rust bindings, FFmpeg plugin) должна соответствовать этому документу. Reference implementation — `libcuframes/src/` в репозитории.
+
+## What changed in v4
+
+В v0.4 заменили механизм sharing'а GPU-памяти с CUDA IPC mem-handles на CUDA VMM + POSIX file descriptors. Это **breaking change** на wire level.
+
+| Aspect | v1–v3 (legacy) | v4 (current) |
+|---|---|---|
+| Magic | `0xCC7C1DCC` | `0xCC7C1DCE` |
+| Protocol version | 1–3 | 4 |
+| GPU memory share | `cudaIpcMemHandle_t` в slot descriptor | `cuMemCreate(POSIX_FILE_DESCRIPTOR)` + `SCM_RIGHTS` |
+| Cross-process sync | `cudaIpcEventHandle_t` + `cudaStreamWaitEvent` | producer's `cuStreamSynchronize` + atomic seq release |
+| PID namespace sharing | required (CUDA IPC ограничение) | **not required** — POSIX FD работает поверх SCM_RIGHTS |
+| Handshake messages | `HELLO` → `SUBSCRIBE` | `HELLO` → `SUBSCRIBE` → `VMM_FDS` (new) |
+| EXTERNAL ownership | поддерживался | удалён (VMM требует cuMemCreate-memory) |
+
+Старый magic `0xCC7C1DCC` приводит к clean fail в v4 (`HELLO_RESP(result=CUFRAMES_ERR_PROTOCOL)`). Mixed-version deployment **не работает** — обновлять надо publisher и всех subscribers одновременно.
+
+См. [Synchronization & VMM stream](/docs/concepts/sync-vmm-stream) — почему именно stream sync, а не CUDA events.
+
+## 1. Resources / Lifecycle
+
+Один publisher создаёт следующие kernel-level ресурсы:
+
+| Resource | Path | Назначение | Cleanup |
+|---|---|---|---|
+| Unix socket | `/run/cuframes/<key>.sock` | Handshake + control plane | unlink при `destroy()`; orphaned после crash — cleanup'ится при next `create` через `O_EXCL` retry |
+| Frame SHM | `/dev/shm/cuframes-<key>` | Frame ring header + slot descriptors | `shm_unlink` при `destroy()`; orphaned автоматически если nobody mmap'ит |
+| VMM-allocated VRAM | (no path; FD получают subscribers через `SCM_RIGHTS`) | Frame data в HBM | освобождается когда last subscriber `cuMemUnmap` + publisher `cuMemRelease` |
+| Packet SHM | `/dev/shm/cuframes-<key>-packets` | Packet ring header + slots + data section | `shm_unlink` при `destroy()`; opt-in (только если вызван `enable_packets`) |
+
+`<key>` — ASCII, `[a-zA-Z0-9_-]`, 1–63 байт. Library валидирует regex `^[a-zA-Z0-9_-]{1,63}$`.
+
+### Normal shutdown
+
+1. `cuframes_publisher_destroy()` вызван.
+2. Publisher шлёт всем connected subscribers `SHUTDOWN` через socket.
+3. Subscribers закрывают VMM mappings (`cuMemUnmap`, `cuMemAddressFree`, `close(fd)`), socket, munmap, возвращают `DISCONNECTED`.
+4. Publisher: `cuMemUnmap` own pool + `cuMemRelease`, close socket, unlink socket + shm.
+
+### Abnormal shutdown (publisher crash)
+
+Producer не успевает unlink. Stale socket и shm остаются. Recovery при next `cuframes_publisher_create()`:
+
+1. Попытка `connect(sock_path)` → `ECONNREFUSED` → stale, unlink.
+2. Открытие `/dev/shm/cuframes-<key>`: если magic совпадает, проверка liveness через `pidfd_open(producer_pid)` или `kill(pid, 0)`.
+3. Если producer dead → `shm_unlink` + продолжаем create.
+4. Если producer жив → return `ALREADY_EXISTS`.
+
+Subscribers detect publisher crash через socket EOF (recv returns 0) или `SIGPIPE` на send. После этого они должны `cuMemUnmap` все slot mappings и вернуть `DISCONNECTED` в `cuframes_subscriber_next`.
+
+**Note про VMM leak risk.** В отличие от CUDA IPC handles, POSIX FDs автоматически очищаются ядром при `close()` процесса. Даже если subscriber падает unclean'но — kernel сам close'нет FD и `cuMemUnmap` отработает в driver'е при последующем `cuMemRelease` publisher'а. В v0.4 это менее опасно чем было в legacy v1–v3.
+
+## 2. Frame SHM layout
+
+`/dev/shm/cuframes-<key>` имеет фиксированный размер: `sizeof(cuframes_shared_header_t)`. Без variable-length секций.
+
+### 2.0 Header byte layout
+
+```
+Offset  Size   Field                    Comments
+─────── ────── ──────────────────────── ────────────────────────────────────────
+0x0000      4  magic (LE u32)           0xCC7C1DCE
+0x0004      4  proto_version (LE u32)   4
+0x0008      4  lib_version_major
+0x000C      4  lib_version_minor
+0x0010      4  lib_version_patch
+0x0014      4  reserved_a               0
+0x0018      8  producer_pid (LE u64)    для liveness check
+0x0020      8  ring_size (LE u64)       1..16
+0x0028      8  ownership_mode           0 (LIBRARY only в v4)
+0x0030      8  policy                   0=DROP_OLDEST, 1=STRICT_WAIT
+0x0038      8  max_subscribers          32 (захардкожено)
+0x0040     64  meta                     frame meta packed (см. §2.1)
+0x0080     64  reserved_events          0 (был ipc_event_handle в v1–v3)
+0x00C0      8  global_seq (LE u64)      atomic, монотонная
+0x00C8      8  subscriber_bitmap        atomic, bit per subscriber slot
+0x00D0      8  shutdown_flag            atomic, 0=normal, 1=shutting down
+0x00D8     40  reserved_b               0
+0x0100   N×D  slots[ring_size]          slot descriptor, см. §2.2
+0x0100+ND M×S subscribers[32]           subscriber slot, см. §2.3
+```
+
+Constants:
+
+- N = `ring_size`, до 16
+- D = `sizeof(slot_descriptor)` = **192 байт** (§2.2)
+- M = 32 (max subscribers)
+- S = `sizeof(subscriber_slot)` = 128 байт (§2.3)
+- Max SHM size: `0x0100 + 16×192 + 32×128` ≈ 7.3 KB
+
+Все atomic-помеченные поля доступны через C11 `_Atomic` (или `__atomic_*`).
+
+### 2.1 Frame meta (64 байта)
+
+```
+Offset  Size  Field                Comments
+0x00       4  format (LE u32)      cuframes_format_t enum
+0x04       4  width
+0x08       4  height
+0x0C       4  pitch_y              байт на строку Y / single plane
+0x10       4  pitch_uv             байт на UV (0 если no chroma plane)
+0x14       4  bits_per_pixel       информативно
+0x18       8  frame_size_bytes     полный размер (Y+UV или packed)
+0x20      32  reserved             0
+```
+
+### 2.2 Slot descriptor (192 байта)
+
+```
+Offset  Size   Field                       Comments
+0x00       8   seq (LE u64, atomic)        published seq; UINT64_MAX = invalid
+0x08       8   pts_ns (LE i64, atomic)
+0x10       8   ack_bitmap (LE u64, atomic) bit i = subscriber #i ACK'нул
+0x18       8   written_bytes               для diagnostics (может быть 0)
+0x20      64   reserved_mem_handle         0 в v4 (был cudaIpcMemHandle_t в v1–v3)
+0x60      32   reserved_external           0 в v4 (был cuda_ptr_external)
+0x80      16   reserved_a
+0x90      48   reserved_b
+0xC0    END
+```
+
+В v4 поле `mem_handle` (offset 0x20, 64 байта) **больше не используется** — вместо IPC handle subscribers получают POSIX FD через `SCM_RIGHTS` во время handshake (см. §3). Поле зарезервировано для возможного re-use в будущих версиях. Reader v4 должен игнорировать его содержимое.
+
+Slot — статичный в плане memory layout (создаётся в publisher_create), но `seq` / `pts_ns` / `ack_bitmap` / `written_bytes` обновляются атомарно на каждом publish.
+
+### 2.3 Subscriber slot (128 байт)
+
+```
+Offset  Size   Field                      Comments
+0x00       8   state (LE u64, atomic)     0=free, 1=connecting, 2=active, 3=draining
+0x08       8   consumer_pid (LE u64)      liveness check
+0x10       8   last_seen_seq (LE u64)     monotonic — последний ACK'нутый seq
+0x18       8   last_ack_ns (LE i64)       wall-clock последнего ACK
+0x20      32   consumer_name              ASCII zero-terminated, max 31 char
+0x40      64   reserved                   0
+```
+
+Bit-position **0** зарезервирован — sentinel. Используются bits 1..31 → max 31 subscriber'ов (errcode `TOO_MANY` при попытке 32-го).
+
+## 3. Unix socket protocol
+
+Publisher `listen()`'ит SOCK_STREAM по `/run/cuframes/<key>.sock`. Subscriber `connect()`'ится.
+
+Socket используется для:
+
+- Handshake (`HELLO`, `SUBSCRIBE`)
+- **VMM file descriptor delivery** (`VMM_FDS`, новое в v4) через `sendmsg(SCM_RIGHTS)`
+- Lifetime signals (`SHUTDOWN`, force-disconnect, `UNSUBSCRIBE`)
+
+Socket **не** используется для frame transfer — это VMM-mapped shared memory + atomic `global_seq`.
+
+### 3.1 Framing
+
+Каждое сообщение — TLV (type-length-value):
+
+```
+[4 bytes]  message_type (LE u32, см. §3.2)
+[4 bytes]  payload_length (LE u32, bytes excl. these 8 header bytes)
+[N bytes]  payload (длина = payload_length)
+```
+
+Для `VMM_FDS` сопровождается ancillary data через `sendmsg` (см. §3.7).
+
+### 3.2 Message types
+
+| ID | Name | Direction | Когда |
+|---|---|---|---|
+| `0x01` | `HELLO_REQ` | C→P | Первое сообщение от consumer |
+| `0x02` | `HELLO_RESP` | P→C | Ответ publisher'а |
+| `0x03` | `SUBSCRIBE_REQ` | C→P | Завершение handshake, выделить subscriber slot |
+| `0x04` | `SUBSCRIBE_RESP` | P→C | Подтверждение + assigned bit |
+| `0x05` | `VMM_FDS` | P→C | **(new в v4)** Передача N file descriptors через SCM_RIGHTS |
+| `0x10` | `UNSUBSCRIBE` | C→P | Graceful disconnect |
+| `0x30` | `SHUTDOWN` | P→C | Publisher shutting down |
+| `0xF0` | `PING` | both | Liveness check |
+| `0xF1` | `PONG` | both | Reply to PING |
+| `0xFE` | `ERROR` | both | Error notification (см. §3.9) |
+
+`0x20` (`EVENT_FD` в legacy v1–v3) — **deprecated в v4**. Wakeup сейчас делается через atomic polling `global_seq`; reintroduce через FD wakeup channel — possible v0.5 feature.
+
+### 3.3 HELLO_REQ payload (consumer → publisher)
+
+```
+[4 bytes]  proto_version (LE u32)        consumer's wire version (must be 4)
+[4 bytes]  consumer_name_len (LE u32)
+[N bytes]  consumer_name (UTF-8, без null-terminator)
+[4 bytes]  cuda_device (LE i32)
+[4 bytes]  mode (LE u32, cuframes_subscriber_mode_t)
+[12 bytes] reserved (must be 0)
+```
+
+### 3.4 HELLO_RESP payload (publisher → consumer)
+
+```
+[4 bytes]  result (LE i32)               0 = success, negative = cuframes_error_t
+[4 bytes]  proto_version_actual (LE u32) publisher's wire version
+[4 bytes]  ring_size (LE u32)
+[4 bytes]  ownership_mode (LE u32)       0 (LIBRARY)
+[64 bytes] frame_meta                    см. §2.1
+[4 bytes]  shm_path_len (LE u32)
+[N bytes]  shm_path (UTF-8, ASCII subset)
+[12 bytes] reserved
+```
+
+При `result < 0` payload может быть короче (только `result` + `reserved`).
+
+### 3.5 SUBSCRIBE_REQ payload
+
+```
+[4 bytes]  proto_version (повтор, для double-check)
+[12 bytes] reserved
+```
+
+### 3.6 SUBSCRIBE_RESP payload
+
+```
+[4 bytes]  result (LE i32)
+[4 bytes]  assigned_bit (LE u32)         1..31
+[8 bytes]  initial_seq (LE u64)          текущий global_seq на момент handshake
+[12 bytes] reserved
+```
+
+После SUBSCRIBE_RESP publisher сразу отправляет `VMM_FDS` (§3.7) — это часть handshake'а, subscriber должен ожидать его перед переходом в READY.
+
+### 3.7 VMM_FDS — file descriptor delivery (v4)
+
+Publisher отправляет `ring_size` file descriptors через `sendmsg()` с ancillary data (`SCM_RIGHTS`).
+
+**TLV payload (без ancillary):**
+
+```
+[4 bytes]  fd_count (LE u32)             должно = ring_size
+[8 bytes]  frame_size_bytes (LE u64)     size одного VMM mapping
+[8 bytes]  vmm_granularity (LE u64)      округление от CUDA driver (обычно 2 MiB)
+[12 bytes] reserved (must be 0)
+```
+
+**Ancillary data:** `cmsg` уровня `SOL_SOCKET`, type `SCM_RIGHTS`, contains `fd_count × sizeof(int)` file descriptors. Kernel duplicates FDs в consumer process automatically.
+
+Consumer обязан:
+
+1. `recvmsg` с buffer для cmsg достаточного размера (рекомендуется `CMSG_SPACE(sizeof(int) * 16)`);
+2. Для каждого FD: `cuMemImportFromShareableHandle(&handle, fd, CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR, 0)`;
+3. `cuMemAddressReserve` → `cuMemMap` → `cuMemSetAccess` (READ_WRITE на consumer device);
+4. `close(fd)` (CUDA driver держит свою reference после import).
+
+После успешного импорта всех FDs subscriber переходит в READY state.
+
+### 3.8 SHUTDOWN payload
+
+```
+[4 bytes]  reason (LE u32)               0 = normal, 1 = error, 2 = upgrade
+[12 bytes] reserved
+```
+
+Subscriber на SHUTDOWN → `cuMemUnmap`/`cuMemAddressFree` на всех slots → возвращает `DISCONNECTED` user'у на next call.
+
+### 3.9 ERROR payload
+
+```
+[4 bytes]  error_code (LE i32)           cuframes_error_t
+[4 bytes]  message_len (LE u32)
+[N bytes]  message (UTF-8)
+[12 bytes] reserved
+```
+
+## 4. Handshake sequence
+
+```mermaid
+sequenceDiagram
+    autonumber
+    participant C as Subscriber (consumer)
+    participant P as Publisher
+
+    C->>P: socket connect /run/cuframes/cam1.sock
+    C->>P: HELLO_REQ (proto_version=4, consumer_name, cuda_device, mode)
+
+    alt proto mismatch / format unsupported
+        P-->>C: HELLO_RESP (result=ERR_PROTOCOL or ERR_FORMAT)
+        Note over C,P: socket close, subscriber returns error
+    else handshake ok
+        P-->>C: HELLO_RESP (result=0, ring_size, frame_meta, shm_path)
+        C->>P: SUBSCRIBE_REQ (proto_version=4)
+
+        alt subscriber slot pool full
+            P-->>C: SUBSCRIBE_RESP (result=ERR_TOO_MANY)
+        else slot allocated
+            P-->>C: SUBSCRIBE_RESP (result=0, assigned_bit, initial_seq)
+            P-->>C: VMM_FDS (fd_count, frame_size, granularity) + SCM_RIGHTS [fd0..fdN]
+            Note over C: cuMemImportFromShareableHandle for each fd<br/>cuMemAddressReserve + Map + SetAccess
+            Note over C,P: READY — subscriber may poll global_seq
+        end
+    end
+```
+
+Sequence для legacy v1–v3 не имеет шага `VMM_FDS` — там slot descriptor в SHM сам нёс `cudaIpcMemHandle_t`, и subscriber делал `cudaIpcOpenMemHandle` без socket interaction. В v4 этот шаг **обязателен** — без `VMM_FDS` subscriber не получит pointer'ы и не сможет читать frames.
+
+## 5. Sync model (v4)
+
+В v0.4 cross-process synchronization упрощена до stream sync на стороне producer'а:
+
+```c
+// Publisher side (libcuframes/src/producer.c)
+cuStreamSynchronize(stream);  // ensure all writes coherent
+atomic_store_release(&slot.seq, next_seq);
+atomic_store_release(&hdr.global_seq, next_seq);
+
+// Consumer side (libcuframes/src/consumer.c)
+uint64_t cur = atomic_load_acquire(&hdr.global_seq);
+if (cur > my_last) {
+    // read slot metadata, do cudaMemcpyAsync(DtoD, consumer_stream)
+    // HW coherence на одном GPU — no event wait нужен
+}
+```
+
+Почему так:
+
+- VMM-shared memory на одном GPU — single physical HBM region. Producer's `cuStreamSynchronize` гарантирует что writes завершились в HBM. Consumer reads после atomic-release-acquire pair видят валидные данные без CUDA event roundtrip.
+- Это **не работает cross-GPU** — для multi-GPU нужен `cuMemSetAccess` на оба device'а + явные events. v4 не поддерживает cross-device VMM share (CUDA driver ограничение).
+- Producer overhead `cuStreamSynchronize` ≈ 1 мс на frame при 25 fps — измеримо но приемлемо.
+
+Подробнее: [Synchronization & VMM stream](/docs/concepts/sync-vmm-stream).
+
+## 6. ACK protocol
+
+При публикации slot N publisher:
+
+1. Записывает frame data в acquire'нутый slot;
+2. `cuStreamSynchronize(stream)` — coherence barrier;
+3. Atomic: `slot[N].seq = next_seq`, `slot[N].pts_ns = now`, `slot[N].ack_bitmap = 0`;
+4. Atomic RELEASE: `global_seq = next_seq`.
+
+Subscriber:
+
+1. ACQUIRE load `global_seq`. Если новое — process;
+2. Process frame (CUDA kernel, DtoD copy, etc.) на consumer_stream;
+3. ACK: `atomic_fetch_or(&slot.ack_bitmap, 1ULL << my_bit, RELEASE)`;
+4. Atomic store `subscriber_slot.last_seen_seq = seq`, `last_ack_ns = now`.
+
+Publisher (STRICT_WAIT mode) перед next publish ждёт:
+
+- `ack_bitmap == subscriber_bitmap` (все active subscribers ACK'нули), или
+- timeout `consumer_ack_timeout_ms` истёк → mark dead subscriber, clear его bit.
+
+Publisher (DROP_OLDEST mode) — не ждёт ACK, просто перезаписывает slot.
+
+## 7. Versioning rules
+
+### 7.1 Wire protocol version
+
+`proto_version` — single integer. Текущий — `4`. Breaking changes → bump.
+
+Handshake:
+
+- Если `subscriber.proto_version != publisher.proto_version` → `HELLO_RESP(result=ERR_PROTOCOL)`. **В v4 нет backward-compat layer'а** с v1–v3 — magic другой, mem-handle scheme другая.
+
+### 7.2 Library version (semver)
+
+`lib_version_major.minor.patch` — informational, не используется для compat decisions. Передаётся в SHM header для diagnostics.
+
+### 7.3 Reserved fields
+
+Все `reserved_*` — должны быть `0` при write, reader игнорирует. Это позволяет в minor releases добавлять fields в reserved space без breaking ABI.
+
+## 8. Conformance fixture
+
+Test skeleton (Phase 1, в `tests/conformance/`):
+
+```c
+TEST(ProtocolLayout, ShmHeaderMagic) {
+    EXPECT_EQ(CUFRAMES_PROTOCOL_MAGIC_V4, 0xCC7C1DCE);
+    EXPECT_EQ(cuframes_protocol_version(), 4);
+}
+
+TEST(ProtocolLayout, ShmHeaderOffsets) {
+    EXPECT_EQ(offsetof(cuframes_shared_header_t, magic), 0x0000);
+    EXPECT_EQ(offsetof(cuframes_shared_header_t, proto_version), 0x0004);
+    EXPECT_EQ(offsetof(cuframes_shared_header_t, meta), 0x0040);
+    EXPECT_EQ(offsetof(cuframes_shared_header_t, global_seq), 0x00C0);
+}
+
+TEST(ProtocolLayout, SlotDescriptorSize) {
+    EXPECT_EQ(sizeof(cuframes_slot_descriptor_t), 192);
+}
+
+TEST(ProtocolLayout, SubscriberSlotSize) {
+    EXPECT_EQ(sizeof(cuframes_subscriber_slot_t), 128);
+}
+
+TEST(Handshake, VmmFdsDelivered) {
+    // Setup publisher with ring_size=4
+    // Connect subscriber, complete HELLO+SUBSCRIBE
+    // Expect: VMM_FDS message with fd_count=4 + 4 valid FDs in SCM_RIGHTS
+}
+```
+
+## 9. Reference implementation
+
+`libcuframes/src/` в repo:
+
+- `producer.c` — VMM allocation, FD send, atomic publish
+- `consumer.c` — FD receive, VMM import, atomic read
+- `protocol.c` — TLV framing, `send_msg_with_fds` / `recv_msg_with_fds` helpers
+
+Любая другая реализация (Python ctypes, Rust bindings, FFmpeg plugin) должна conformance-tested против этого документа.
+
+## 10. Packet ring (proto_version remains 4)
+
+Packet ring — отдельный SHM `/dev/shm/cuframes-<key>-packets` с собственным magic `0xCC7C1DCD`. **Не использует CUDA** (encoded data на CPU, без GPU sync). VMM-изменения v0.4 packet ring **не затронули** — wire format и layout совпадают с v0.2 спецификацией.
+
+### 10.1 Совместимость с frames ring
+
+- Subscriber может запросить только frames, только packets, или оба (см. §10.9 — subscribe flags).
+- Packet ring опционален — если publisher не вызвал `enable_packets`, subscriber'ам `enable_packets` вернёт `NOT_FOUND`.
+
+### 10.2 Packet SHM layout
+
+Размер: `sizeof(packet_ring_header_t) + N×PSE + DATA_SIZE`, где:
+
+- N = `ring_slots`, default 64 (1..1024)
+- PSE = `sizeof(packet_slot_entry_t)` = 64 байт (§10.5)
+- DATA_SIZE = `data_size`, default 8 MB
+
+#### Header byte layout
+
+```
+Offset                  Size   Field                       Comments
+─────────────────────── ────── ──────────────────────────  ────────────────────────────
+0x0000                       4 magic (LE u32)              0xCC7C1DCD  (frames magic + (−1))
+0x0004                       4 proto_version (LE u32)      4 (совпадает с frames в v0.4)
+0x0008                       4 ring_slots (LE u32)         N (1..1024)
+0x000C                       4 data_size (LE u32)          bytes for packet data ring
+0x0010                       4 codec_id (LE u32)           AV_CODEC_ID_* enum
+0x0014                       4 codec_extradata_size        ≤ 4096
+0x0018                       8 producer_pid (LE u64)
+0x0020                       8 global_seq (LE u64, atomic) монотонная по packets
+0x0028                       8 last_keyframe_seq (atomic)  для late subscribers
+0x0030                       8 write_offset (LE u64, atom) текущий cursor в data ring
+0x0038                       8 shutdown_flag (atomic)
+0x0040                    4096 codec_extradata             SPS/PPS/VPS bytes
+0x1040                    N×64 slots[N]                    packet_slot_entry_t
+0x1040+N×64           DATA_SIZE data[]                     wraparound byte buffer
+```
+
+### 10.3 Codec extradata
+
+H.264 — SPS + PPS, конкатенированные в **Annex B** формате (start codes `00 00 00 01`). H.265 — VPS + SPS + PPS.
+
+`codec_id` соответствует FFmpeg `AV_CODEC_ID_H264` (27), `AV_CODEC_ID_HEVC` (173), `AV_CODEC_ID_AV1`. Subscriber пишет extradata в `AVCodecContext.extradata` своего decoder'а или в `AVStream.codecpar->extradata` для muxer'ов.
+
+Extradata устанавливается publisher'ом **один раз** при первом keyframe (или из RTSP SDP). После — fixed на lifetime publisher'а (codec change mid-stream → destroy + recreate с новым `<key>`).
+
+### 10.4 Packet slot entry (64 байта)
+
+```
+Offset  Size   Field                       Comments
+0x00       8   seq (LE u64, atomic)        published seq; UINT64_MAX = invalid
+0x08       8   pts_ns (LE i64)
+0x10       8   dts_ns (LE i64)             для B-frames
+0x18       8   data_offset (LE u64)        absolute byte cursor в data[]
+0x20       4   data_size (LE u32)          payload size
+0x24       4   flags (LE u32)              см. §10.5
+0x28      24   reserved                    0
+```
+
+`data_offset` — absolute byte cursor (может расти неограниченно), фактический byte index = `data_offset % data_size`. Subscriber'у возможно понадобится split read при wraparound.
+
+### 10.5 Packet flags
+
+| Bit | Name | Comments |
+|---|---|---|
+| 0 | `KEY` | keyframe (IDR for H.264, CRA/IDR for HEVC). **Critical** для late subscribers. |
+| 1 | `CORRUPT` | publisher detect'нул damaged packet (RTP loss) — subscriber может skip |
+| 2 | `DISCONTINUITY` | был gap перед этим packet (publisher reconnect к камере) |
+| 3 | `LAST_IN_AU` | last NAL в access unit — для muxer'ов которые ждут полный frame |
+| 4–31 | reserved | 0 |
+
+Mapping в `AVPacket.flags`:
+
+- bit 0 → `AV_PKT_FLAG_KEY`
+- bit 1 → `AV_PKT_FLAG_CORRUPT`
+- bit 2 → `AV_PKT_FLAG_DISCONTINUITY` (FFmpeg 5+)
+
+### 10.6 Atomic publish (pseudo-C)
+
+```c
+uint64_t seq = atomic_load(&hdr->global_seq, RELAXED) + 1;
+uint64_t off = atomic_load(&hdr->write_offset, RELAXED);
+
+size_t slot_idx = seq % hdr->ring_slots;
+packet_slot_entry_t *slot = &slots[slot_idx];
+
+// Wraparound write
+size_t off_in_ring = off % hdr->data_size;
+size_t first = min(size, hdr->data_size - off_in_ring);
+memcpy(data + off_in_ring, payload, first);
+if (first < size)
+    memcpy(data, payload + first, size - first);
+
+slot->pts_ns = pts;
+slot->dts_ns = dts;
+slot->data_offset = off;
+slot->data_size = size;
+slot->flags = flags;
+atomic_store(&slot->seq, seq, RELEASE);
+
+atomic_store(&hdr->write_offset, off + size, RELEASE);
+atomic_store(&hdr->global_seq, seq, RELEASE);
+
+if (flags & PKT_FLAG_KEY)
+    atomic_store(&hdr->last_keyframe_seq, seq, RELEASE);
+```
+
+### 10.7 Atomic read (pseudo-C)
+
+```c
+uint64_t cur = atomic_load(&hdr->global_seq, ACQUIRE);
+if (cur <= my_last_seq) return TIMEOUT;
+
+uint64_t want_seq = my_last_seq + 1;
+size_t slot_idx = want_seq % hdr->ring_slots;
+packet_slot_entry_t *slot = &slots[slot_idx];
+
+uint64_t slot_seq = atomic_load(&slot->seq, ACQUIRE);
+if (slot_seq != want_seq) {
+    // Overrun — re-anchor on last keyframe
+    want_seq = atomic_load(&hdr->last_keyframe_seq, ACQUIRE);
+    return PACKET_OVERRUN;
+}
+
+// Copy payload (wraparound aware)
+uint64_t off = slot->data_offset % hdr->data_size;
+uint32_t size = slot->data_size;
+uint32_t first = min(size, hdr->data_size - off);
+memcpy(out, data + off, first);
+if (first < size)
+    memcpy(out + first, data, size - first);
+
+// Re-check (seqlock variant) — защита от overrun mid-read
+if (atomic_load(&slot->seq, ACQUIRE) != want_seq)
+    return PACKET_OVERRUN;
+
+my_last_seq = want_seq;
+return OK;
+```
+
+### 10.8 Late subscriber → keyframe-aligned start
+
+При `cuframes_subscriber_enable_packets()` subscriber устанавливает `my_last_seq = last_keyframe_seq - 1` (читая `last_keyframe_seq` из header'а). Первый `next_packet` вернёт keyframe (decoder может start без glitches).
+
+**Risk:** если в момент enable_packets `last_keyframe_seq` уже выехал из ring (slow start subscriber, GOP > ring_slots packets) — subscriber detect overrun в первом read и library переходит на следующий keyframe.
+
+Sizing guide: `packet_ring_slots × avg_packet_size > GOP_size_in_bytes` для нормальной работы.
+
+### 10.9 Socket protocol extensions для packets
+
+#### HELLO_REQ — subscribe flags в reserved field
+
+v4 интерпретирует первые 4 байта `reserved` (offset HELLO_REQ + 28) как `subscribe_flags`:
+
+| Bit | Name | Comments |
+|---|---|---|
+| 0 | `WANTS_FRAMES` | подписаться на frames ring (default — implicit) |
+| 1 | `WANTS_PACKETS` | подписаться на packet ring |
+| 2–31 | reserved | 0 |
+
+Если subscriber оставляет flags=0 — publisher интерпретирует как `WANTS_FRAMES=true, WANTS_PACKETS=false`.
+
+#### HELLO_RESP — packet-ring fields в reserved секции
+
+```
+[4 bytes]  packet_shm_path_len (LE u32)  0 = packets disabled at publisher
+[N bytes]  packet_shm_path (UTF-8)       e.g. "cuframes-cam1-packets" (relative to /dev/shm/)
+[4 bytes]  codec_id (LE u32)             AV_CODEC_ID_*
+[8 bytes]  initial_packet_seq (LE u64)   last_keyframe_seq на момент handshake
+```
+
+Если subscriber запросил `WANTS_PACKETS=1` но publisher не имеет packet ring → `result = ERR_NOT_FOUND`.
+
+## 11. Open для v0.5+
+
+Эти решения **не** должны нарушить v4 compat:
+
+- **FD wakeup channel** — возврат `EVENT_FD` message + eventfd для replace polling (latency win);
+- **Multi-codec в одном publisher** — отдельный slot для thumbnail meta;
+- **AMD/ROCm HIP IPC** — заменит cuMemCreate на rocSharedMemoryCreate;
+- **Cross-host** через RDMA — отдельный transport, новый proto_version.
+
+Любое из этих → bump `proto_version` в v5, отдельный document.
+
+## See also
+
+- [C API reference](/docs/reference/api-c) — обёртка над protocol.
+- [Synchronization & VMM stream](/docs/concepts/sync-vmm-stream) — почему `cuStreamSynchronize`.
+- [Frame ring vs Packet ring](/docs/concepts/frame-vs-packet-ring) — когда нужно использовать какой канал.
+- [Install](/docs/getting-started/install) — runtime требования.
@@ -0,0 +1,146 @@
+import { themes as prismThemes } from 'prism-react-renderer';
+import type { Config } from '@docusaurus/types';
+import type * as Preset from '@docusaurus/preset-classic';
+
+const config: Config = {
+  title: 'cuframes',
+  tagline: 'Zero-copy decoded video frames over CUDA — across processes, without pid sharing.',
+  favicon: 'img/favicon.ico',
+
+  future: {
+    v4: true,
+  },
+
+  url: 'https://cuframes.dev',
+  baseUrl: '/',
+
+  organizationName: 'gx',
+  projectName: 'cuframes-docs',
+
+  onBrokenLinks: 'throw',
+
+  i18n: {
+    defaultLocale: 'en',
+    locales: ['en', 'ru'],
+    localeConfigs: {
+      en: { label: 'English', direction: 'ltr', htmlLang: 'en-US' },
+      ru: { label: 'Русский', direction: 'ltr', htmlLang: 'ru-RU', path: 'ru', translate: true },
+    },
+  },
+
+  presets: [
+    [
+      'classic',
+      {
+        docs: {
+          sidebarPath: './sidebars.ts',
+          editUrl:
+            'https://git.goldix.org/gx/cuframes-docs/_edit/main/site/',
+          showLastUpdateTime: true,
+          showLastUpdateAuthor: false,
+        },
+        blog: false,
+        theme: {
+          customCss: './src/css/custom.css',
+        },
+      } satisfies Preset.Options,
+    ],
+  ],
+
+  themes: [
+    '@docusaurus/theme-mermaid',
+    [
+      require.resolve('@easyops-cn/docusaurus-search-local'),
+      {
+        hashed: true,
+        language: ['en', 'ru'],
+        indexDocs: true,
+        indexBlog: false,
+        indexPages: false,
+        docsRouteBasePath: '/docs',
+        highlightSearchTermsOnTargetPage: true,
+        explicitSearchResultPath: true,
+      },
+    ],
+  ],
+
+  markdown: {
+    mermaid: true,
+    hooks: {
+      onBrokenMarkdownLinks: 'warn',
+    },
+  },
+
+  themeConfig: {
+    // OpenGraph social card — добавить позже (1200×630).
+
+    navbar: {
+      title: 'cuframes',
+      // Logo TBD — пока без logo block (Docusaurus покажет только title).
+      items: [
+        {
+          type: 'docSidebar',
+          sidebarId: 'mainSidebar',
+          position: 'left',
+          label: 'Documentation',
+        },
+        {
+          href: 'https://git.goldix.org/gx/cuframes',
+          label: 'Source',
+          position: 'right',
+        },
+        {
+          type: 'localeDropdown',
+          position: 'right',
+        },
+      ],
+    },
+
+    footer: {
+      style: 'dark',
+      links: [
+        {
+          title: 'Documentation',
+          items: [
+            { label: 'What is cuframes', to: '/docs/intro' },
+            { label: 'Getting started', to: '/docs/getting-started/install' },
+            { label: 'FFmpeg integration', to: '/docs/integration/ffmpeg-demuxer' },
+            { label: 'API reference', to: '/docs/reference/api-c' },
+          ],
+        },
+        {
+          title: 'Project',
+          items: [
+            { label: 'Source (Gitea)', href: 'https://git.goldix.org/gx/cuframes' },
+            { label: 'Issues', href: 'https://git.goldix.org/gx/cuframes/issues' },
+            { label: 'Releases', href: 'https://git.goldix.org/gx/cuframes/releases' },
+            { label: 'License (LGPL-2.1+)', href: 'https://git.goldix.org/gx/cuframes/raw/branch/main/LICENSE' },
+          ],
+        },
+        {
+          title: 'Related',
+          items: [
+            { label: 'NVIDIA CUDA VMM API', href: 'https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html' },
+            { label: 'FFmpeg', href: 'https://ffmpeg.org/' },
+            { label: 'Frigate (NVR consumer example)', href: 'https://frigate.video/' },
+          ],
+        },
+      ],
+      copyright: `© ${new Date().getFullYear()} cuframes contributors. Licensed under LGPL-2.1+.`,
+    },
+
+    prism: {
+      theme: prismThemes.github,
+      darkTheme: prismThemes.dracula,
+      additionalLanguages: ['c', 'cpp', 'python', 'bash', 'docker', 'yaml'],
+    },
+
+    colorMode: {
+      defaultMode: 'light',
+      disableSwitch: false,
+      respectPrefersColorScheme: true,
+    },
+  } satisfies Preset.ThemeConfig,
+};
+
+export default config;
@@ -0,0 +1,7 @@
+{
+  "label": "Концепции",
+  "position": 3,
+  "collapsible": true,
+  "collapsed": false,
+  "link": null
+}
@@ -0,0 +1,118 @@
+---
+title: Frame ring vs Packet ring
+sidebar_position: 1
+---
+
+# Frame ring vs Packet ring
+
+cuframes даёт publisher'у два **независимых** ring buffer'а с разной семантикой и разной стоимостью:
+
+- **Frame ring** — decoded NV12 (или другой pixel format), shared zero-copy через CUDA VMM.
+- **Packet ring** — encoded H.264 / H.265 NAL units, shared через POSIX shared memory.
+
+Это два разных канала на одной паре publisher↔subscriber. Можно использовать один, оба, или иметь несколько consumer'ов где одни читают frame ring, другие — packet ring.
+
+## Зачем два канала
+
+Один publisher обычно обслуживает разные классы потребителей. AI-детектор и GPU-композитор хотят уже decoded GPU pointer (frame ring); NVR-recorder и replay-сервис хотят compact encoded stream (packet ring). Заставлять recorder декодировать ради того чтобы потом снова encode'нуть для сохранения — пустая работа.
+
+```mermaid
+flowchart LR
+  RTSP[RTSP camera] --> Dec[NVDEC decode]
+  Dec --> Pub[Publisher]
+  Pub -- frame ring<br/>CUDA VMM --> AI[AI inference<br/>GPU consumer]
+  Pub -- frame ring<br/>CUDA VMM --> Comp[GPU compositor<br/>CUDA filter]
+  Pub -- packet ring<br/>POSIX shm --> NVR[NVR / recorder<br/>mp4 mux]
+  Pub -- packet ring<br/>POSIX shm --> Replay[Replay / seek service]
+```
+
+## Frame ring
+
+**Что:** ring из N CUDA-allocated slot'ов (`cuMemCreate(POSIX_FILE_DESCRIPTOR)`), экспортируются через `SCM_RIGHTS` consumer'у, импортируются через `cuMemImportFromShareableHandle`. Consumer получает CUDA device pointer на ту же физическую HBM-память что и publisher.
+
+**Когда использовать:**
+
+- consumer работает на GPU и хочет данные as-is (AI inference, CUDA filter, NVENC re-encode на другом codec);
+- латентность критична — между publish и consume hardware coherence, без encode/decode roundtrip;
+- consumer декодировать сам не хочет.
+
+**Стоимость:** `ring_size × frame_size` GPU-памяти на publisher. Для NV12 1920×1080 ring=4 это ≈ 12 MiB на publisher (VMM granularity на RTX 5090 — 2 MiB, реально ≈ 16 MiB). Consumer'ы memory не платят — это та же физическая память.
+
+API:
+
+```c
+cuframes_publisher_create(&cfg, &pub);            // ring аллоцируется
+cuframes_publisher_acquire(pub, &cuda_ptr);       // получаем slot
+// ... NVDEC / kernel пишут в cuda_ptr ...
+cuframes_publisher_publish(pub, stream, pts_ns);
+```
+
+## Packet ring
+
+**Что:** ring из slot'ов с metadata (pts, dts, size, flags) + отдельная data-секция (default 8 MiB) в POSIX shm `/dev/shm/cuframes-<key>-packets`. Publisher закидывает туда encoded NAL units (Annex B byte stream).
+
+**Когда использовать:**
+
+- consumer декодирует сам (FFmpeg demuxer, recorder, на остальном GPU нет места);
+- нужен compact stream для записи на диск;
+- late subscriber должен сам resync'нуться от ближайшего keyframe — это семантика ring'а (см. [protocol reference](/docs/reference/protocol)).
+
+**Стоимость:** POSIX shm на host — `data_size + ring_slots × 64 байта`. На GPU расход нулевой.
+
+Packet ring **опционален и отдельно активируется** на уже созданном publisher'е:
+
+```c
+cuframes_publisher_create(&cfg, &pub);
+
+cuframes_packet_ring_options_t pkt_opts = {
+    .ring_slots      = 64,
+    .data_size       = 8 * 1024 * 1024,
+    .max_packet_size = 2 * 1024 * 1024,
+    .codec_id        = 27,   // AV_CODEC_ID_H264
+};
+cuframes_publisher_enable_packets(pub, &pkt_opts);
+cuframes_publisher_set_codec_extradata(pub, sps_pps, sps_pps_size);
+
+// в цикле:
+cuframes_publisher_publish_packet(pub, nal_data, nal_size,
+                                  pts_ns, dts_ns,
+                                  CUFRAMES_PKT_FLAG_KEY);
+```
+
+Аналогично на subscriber'е:
+
+```c
+cuframes_subscriber_create(&cfg, &sub);
+cuframes_subscriber_enable_packets(sub);   // открывает второй SHM
+
+cuframes_packet_t *pkt;
+cuframes_subscriber_next_packet(sub, &pkt, -1);
+```
+
+Subscriber может включить любую комбинацию: только frame ring, только packet ring, оба сразу. Это два независимых SHM segment'а с разными magic.
+
+## Сравнение
+
+| | Frame ring | Packet ring |
+|---|---|---|
+| Содержимое | Decoded NV12 / RGB / etc | Encoded H.264 / H.265 NAL |
+| Транспорт | CUDA VMM + POSIX FD | POSIX shm |
+| Sync mechanism | atomic seq + `cuStreamSynchronize` | atomic seq (нет CUDA) |
+| Latency publish→consume | sub-frame, без encode roundtrip | sub-frame, но consumer декодирует |
+| Memory cost (publisher) | `ring_size × frame_size` GPU | `data_size` host shm |
+| Memory cost (consumer) | 0 (shared physical pages) | 0 (mmap same shm) |
+| Требует CUDA на consumer | да | нет |
+| Late join semantics | newest frame документирован | resync на last keyframe |
+| Типичный use case | AI inference, GPU compositor | NVR recording, replay |
+
+## Можно ли один без другого
+
+Да. Frame ring аллоцируется в `cuframes_publisher_create` — без него publisher вообще не существует. Packet ring опционален: если `cuframes_publisher_enable_packets` не вызвать, publisher просто не примет `publish_packet`, а subscriber на `enable_packets` получит `CUFRAMES_ERR_NOT_FOUND`.
+
+Обратное (packet ring без frame ring) в текущем API не поддерживается — для pure encoded-only сценариев это TODO будущей версии.
+
+## Следующее
+
+- [Ownership modes](/docs/concepts/ownership-modes) — как выбрать кто аллоцирует ring.
+- [Synchronization](/docs/concepts/sync-vmm-stream) — почему frame ring sync через stream sync, а не через CUDA events.
+- [Первый publisher](/docs/getting-started/first-publisher) — рабочий C-пример без packet ring.
@@ -0,0 +1,119 @@
+---
+title: Ownership modes
+sidebar_position: 2
+---
+
+# Ownership modes
+
+Кто владеет CUDA-памятью в которую publisher пишет frame'ы — сама library или внешний код. В заголовке `cuframes.h` объявлены оба варианта:
+
+```c
+typedef enum cuframes_ownership_mode {
+    CUFRAMES_OWNERSHIP_LIBRARY  = 0,
+    CUFRAMES_OWNERSHIP_EXTERNAL = 1,
+} cuframes_ownership_mode_t;
+```
+
+Но в v0.4 **работает только `LIBRARY`**. `EXTERNAL` оставлен в API для бинарной совместимости и помечен deprecated. Ниже — почему и что с этим делать если твой код раньше использовал EXTERNAL.
+
+## LIBRARY mode (единственный рабочий в v0.4)
+
+Publisher просит library аллоцировать ring заданного размера. Каждый кадр publisher получает чистый slot, пишет в него, отдаёт обратно через publish.
+
+```c
+cuframes_publisher_config_t cfg = {
+    .key       = "cam1",
+    .width     = 1920,
+    .height    = 1080,
+    .format    = CUFRAMES_FORMAT_NV12,
+    .ownership = CUFRAMES_OWNERSHIP_LIBRARY,
+    .ring_size = 4,
+    .policy    = CUFRAMES_POLICY_DROP_OLDEST,
+};
+cuframes_publisher_t *pub;
+cuframes_publisher_create(&cfg, &pub);
+
+void *slot;
+cuframes_publisher_acquire(pub, &slot);
+// NVDEC / cuMemcpy / kernel пишет в slot
+cuframes_publisher_publish(pub, stream, cuframes_now_ns());
+```
+
+Под капотом library:
+
+1. Аллоцирует `ring_size` слотов через `cuMemCreate(CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR)`.
+2. `cuMemMap` + `cuMemSetAccess` чтобы локальный publisher мог писать.
+3. При subscribe передаёт POSIX FD через `sendmsg(SCM_RIGHTS)`.
+4. Subscriber делает `cuMemImportFromShareableHandle` — получает указатель на ту же физическую HBM.
+
+Этот путь zero-copy для consumer'ов. На publisher'е memory overhead = `ring_size × frame_size`. На consumer'е — ноль.
+
+## EXTERNAL mode (deprecated в v0.4)
+
+Идея была: publisher уже имеет CUDA pointer'ы из чужого pool'а (FFmpeg `AVHWFramesContext`, NVDEC output, DeepStream, какой-то custom decoder) и хочет их просто пошарить, без extra allocation и без D2D copy.
+
+```c
+// Так раньше работало в v0.1–0.3. В v0.4 не работает.
+cuframes_publisher_create_external(&cfg, ffmpeg_pool_ptrs,
+                                   pool_size, frame_size, &pub);
+```
+
+В v0.4 этот путь сломан by design.
+
+### Почему EXTERNAL не работает с VMM
+
+cuframes v0.4 публикует frame'ы через POSIX FD, потому что это единственный CUDA-IPC канал который **не требует shared PID namespace** (см. [memory note про v0.4](/docs/intro) и [sync writeup](/docs/concepts/sync-vmm-stream)). FD получается через `cuMemExportToShareableHandle` — а эта функция **требует**, чтобы память была аллоцирована через `cuMemCreate` с соответствующим `requestedHandleType`.
+
+Существующий `cudaMalloc` / `cudaMallocPitch` pointer (тот что отдаёт FFmpeg или DeepStream) к VMM не относится. Экспортировать его как POSIX FD нечем. Старый путь v0.3 использовал `cudaIpcGetMemHandle` (opaque 64-байтовая структура, передавалась через socket payload) — он работал с любой памятью, но требовал shared PID. На v0.4 от него ушли осознанно.
+
+### Что делать вместо
+
+Если у тебя уже есть GPU pool из FFmpeg / NVDEC / etc — переходи на LIBRARY mode с одним extra device-to-device copy:
+
+```c
+// FFmpeg выдаёт frame в hwframe pool:
+AVFrame *src = ...;            // src->data[0] = cudaMalloc'd by FFmpeg
+
+void *slot;
+cuframes_publisher_acquire(pub, &slot);
+
+// 1 × DtoD copy с pitch:
+cuMemcpy2DAsync(&(CUDA_MEMCPY2D){
+    .srcMemoryType = CU_MEMORYTYPE_DEVICE,
+    .srcDevice = (CUdeviceptr)src->data[0],
+    .srcPitch  = src->linesize[0],
+    .dstMemoryType = CU_MEMORYTYPE_DEVICE,
+    .dstDevice = (CUdeviceptr)slot,
+    .dstPitch  = pub_pitch,
+    .WidthInBytes = width,
+    .Height       = height_y + height_uv,
+}, stream);
+
+cuframes_publisher_publish(pub, stream, pts_ns);
+```
+
+Так переведён инструмент `cuframes-rtsp-source` в составе репозитория cuframes — раньше он принимал FFmpeg pool через EXTERNAL, теперь делает acquire + 1 D2D copy. Overhead — единичный DtoD на 1920×1080 NV12 это десятки микросекунд, в порядке шума на фоне `cuStreamSynchronize`.
+
+### Memory trade-off
+
+| | LIBRARY (v0.4) | EXTERNAL (v0.3, deprecated) |
+|---|---|---|
+| Publisher extra alloc | `ring_size × frame_size` | 0 |
+| D2D copy на frame | 1 (если есть upstream pool) или 0 (если decoder пишет прямо в slot) | 0 |
+| Zero-copy для consumer'ов | да | да |
+| Работает без shared PID | да | нет |
+| Поддерживается в v0.4 | да | **нет** |
+
+Если decoder можно научить писать прямо в slot (`acquire` сначала, потом decode в полученный pointer) — extra D2D исчезает. Так делает `cuframes-rtsp-source` со своим NVDEC pipeline'ом.
+
+## Вернётся ли EXTERNAL
+
+Если NVIDIA добавит способ экспорта `cudaMalloc`-памяти как POSIX FD — да, это вернёт zero-D2D путь без жертвования cross-namespace. На момент CUDA 12.4 такого API нет, и в roadmap NVIDIA это не анонсировано. На практике рассчитывать на это не стоит.
+
+Поле `ownership` в `cuframes_publisher_config_t` остаётся ради ABI стабильности. Передача `CUFRAMES_OWNERSHIP_EXTERNAL` в v0.4 вернёт `CUFRAMES_ERR_INVALID_ARG`. Вызов `cuframes_publisher_create_external` объявлен в заголовке, но возвращает ту же ошибку.
+
+## Следующее
+
+- [Synchronization](/docs/concepts/sync-vmm-stream) — почему v0.4 ушёл от CUDA events и почему это связано с тем же VMM-ограничением.
+- [Первый publisher](/docs/getting-started/first-publisher) — рабочий LIBRARY-mode пример.
+- [Protocol reference](/docs/reference/protocol) — wire format VMM_FDS handshake.
@@ -0,0 +1,121 @@
+---
+title: "Синхронизация: stream sync, не CUDA events"
+sidebar_position: 3
+---
+
+# Синхронизация: stream sync, не CUDA events
+
+Между producer'ом и consumer'ом в разных процессах нужен механизм, который гарантирует: к моменту когда consumer начинает читать slot, **все GPU writes producer'а в этот slot уже зафиксированы в HBM**. До v0.4 этим занимались CUDA IPC events. С v0.4 — `cuStreamSynchronize` + atomic ordering. Смена не косметическая, и здесь объяснено почему.
+
+## Что было в v0.3 — CUDA IPC events
+
+Producer на каждый publish делал `cudaEventRecord` на свой stream. Handle event'а (`cudaIpcEventHandle_t`) экспортировался один раз при старте и шарился со всеми subscriber'ами. Subscriber на каждый frame делал `cudaStreamWaitEvent` на свой stream — GPU scheduler сам ждал completion record'а producer'а перед тем как пустить DtoD copy в очередь.
+
+Преимущество: CPU не блокируется. Producer кидает работу в очередь и едет дальше; ожидание происходит в GPU command queue.
+
+Недостаток: **CUDA IPC events требуют shared PID namespace между процессами** — точно так же как требовал `cudaIpcOpenMemHandle`. NVIDIA Driver API экспортирует event handle только через тот же legacy IPC механизм, для которого нет POSIX FD аналога. `CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR` существует для memory, для events — нет.
+
+Этот недостаток нас и убил на Frigate. Frigate под s6-overlay не отдаёт shared PID — а попытка `cudaIpcOpenEventHandle` без shared PID падает молча и subscribe зависает на первом frame timeout'ом. См. [memory feedback про pid share](/docs/intro).
+
+## Что делает v0.4 вместо
+
+Producer перед публикацией seq делает **`cuStreamSynchronize`** на тот stream куда писались GPU данные. Это блокирующая CPU-операция — функция возвращается только когда все pending writes этого stream'а зафиксированы. После этого atomic store в shared header.
+
+Из `producer.c::do_publish` (v0.4):
+
+```c
+static int do_publish(cuframes_publisher_t *pub, int32_t slot,
+                      void *stream, int64_t pts_ns)
+{
+    /* 1. ждём GPU writes этого stream'а */
+    cuStreamSynchronize((CUstream)stream);
+
+    /* 2. обнуляем ack-bitmap для нового seq */
+    atomic_store_explicit(&pub->hdr->slots[slot].ack_bitmap, 0,
+                          memory_order_release);
+    atomic_store_explicit(&pub->hdr->slots[slot].pts_ns, pts_ns,
+                          memory_order_release);
+
+    /* 3. publish slot.seq — после этого consumer его увидит */
+    atomic_store_explicit(&pub->hdr->slots[slot].seq, pub->next_seq,
+                          memory_order_release);
+
+    /* 4. publish global_seq — wake-up для poll'ящих consumer'ов */
+    atomic_store_explicit(&pub->hdr->global_seq, pub->next_seq,
+                          memory_order_release);
+    return 0;
+}
+```
+
+Consumer на той стороне (`consumer.c`) делает acquire-load, читает slot, и **повторно проверяет seq** уже после того как зафиксировал указатель — это защита от того что producer успел перезаписать slot между чтением `global_seq` и реальным копированием:
+
+```c
+uint64_t gs = atomic_load_explicit(&sub->hdr->global_seq, memory_order_acquire);
+// ... найти slot_idx по gs ...
+uint64_t slot_seq = atomic_load_explicit(&sub->hdr->slots[slot_idx].seq,
+                                          memory_order_acquire);
+int64_t pts = atomic_load_explicit(&sub->hdr->slots[slot_idx].pts_ns,
+                                    memory_order_acquire);
+
+/* v0.4: producer уже cuStreamSynchronize'нул перед atomic_store seq.
+ * post-check verify_seq защищает от перезаписи slot'а producer'ом. */
+uint64_t verify_seq = atomic_load_explicit(&sub->hdr->slots[slot_idx].seq,
+                                            memory_order_acquire);
+```
+
+После post-check consumer делает DtoD memcpy в свой stream. На одном GPU **hardware coherence гарантирована** — HBM один, cache L2 общий, после `cuStreamSynchronize` у producer'а все его writes уже в L2/HBM, и любой subsequent kernel/copy с того же GPU их увидит.
+
+## Sequence diagram
+
+```mermaid
+sequenceDiagram
+    participant PApp as Publisher app
+    participant PStream as Publisher CUDA stream
+    participant HBM as GPU HBM (shared)
+    participant Header as SHM header (atomic)
+    participant CApp as Consumer app
+    participant CStream as Consumer CUDA stream
+
+    PApp->>PStream: kernel / NVDEC writes
+    Note over PStream,HBM: async — pending in stream
+    PApp->>PStream: cuStreamSynchronize
+    PStream-->>HBM: all writes flushed
+    PStream-->>PApp: return (CPU unblocked)
+    PApp->>Header: atomic_store(seq, release)
+
+    loop poll
+        CApp->>Header: atomic_load(global_seq, acquire)
+    end
+    CApp->>Header: read pts, slot_seq
+    CApp->>Header: atomic_load(slot_seq, acquire)  [verify]
+    CApp->>CStream: cuMemcpyDtoDAsync(slot → dst)
+    CStream->>HBM: read — sees flushed data
+    CStream-->>CApp: copy enqueued
+```
+
+## Trade-offs
+
+| | v0.3 (CUDA IPC events) | v0.4 (stream sync) |
+|---|---|---|
+| Cross-namespace (без shared PID) | **нет** | да |
+| CPU блокировка на publish | нет | да, ~1 ms |
+| GPU command queue ordering | автоматически | вручную (acquire/release) |
+| Лишний race без post-check | нет | да, защищён verify_seq |
+| Зависимость от CUDA Driver feature | `cudaIpcGetEventHandle` | `cuStreamSynchronize` (всегда есть) |
+| Совместимость с s6-overlay / Frigate | сломано | работает |
+
+`cuStreamSynchronize` стоит порядка миллисекунды (зависит от того сколько pending work на stream'е). На 25 fps publisher это ≈ 2.5% CPU времени publisher thread'а — заметно, но не критично для real-time CCTV. Если для твоего сценария это дорого — возможна оптимизация через `cuEventQuery` polling, но v0.4 этого пока не делает (sync проще, корректнее, и достаточно дёшев).
+
+Отказ от events — это не «лучше или хуже», это смена области применения. v0.3 не работал в s6/Frigate. v0.4 работает, ценой ~1 ms CPU per publish.
+
+## Что это означает для разработчика
+
+- Передавай в `cuframes_publisher_publish` **тот же stream**, на котором писались данные. Иначе `cuStreamSynchronize` будет ждать чужие writes (в худшем случае — никаких) и data race вернётся.
+- `stream = NULL` (default stream) допустим, но default stream сериализуется со всем GPU-контекстом — это обычно медленнее чем dedicated stream.
+- Consumer не должен полагаться на CUDA event sync — его больше нет. Stream sync на producer'е + atomic ordering на consumer'е заменяют всю старую IPC event machinery.
+
+## Следующее
+
+- [Frame ring vs Packet ring](/docs/concepts/frame-vs-packet-ring) — packet ring sync проще, там нет CUDA вообще.
+- [Ownership modes](/docs/concepts/ownership-modes) — почему VMM ограничение убрало и EXTERNAL и события одновременно.
+- [Protocol reference](/docs/reference/protocol) — точный layout shared header и atomic-полей.
@@ -0,0 +1,105 @@
+---
+title: FAQ
+sidebar_position: 99
+---
+
+# Частые вопросы
+
+## cuframes production-ready?
+
+Честный ответ: **ранняя стадия, но проверено в одном реальном развёртывании**. v0.4 крутится 24+ часов в CCTV-сетапе (4 IP-камеры → publisher → 2 subscriber'а: NVR-запись + grid-composer → TV) без вмешательства. Это не «проверено в продакшене на масштабе» — это «домашний сетап одного инженера».
+
+Рекомендуем cuframes если:
+
+- Ты строишь свой video pipeline и понимаешь риски OSS-библиотеки на v0.x.
+- Твоя команда сможет прочитать исходник если что-то сломается (это ≈ 2k строк C).
+- Ты лучше запиннишь известный commit, чем будешь гоняться за semver-обещаниями.
+
+Не рекомендуем cuframes если:
+
+- Нужны контракты vendor support.
+- Ты поставляешь клиентам, которые могут в 3 утра завести инцидент на email maintainer'а.
+- Не можешь позволить себе писать workaround если фича придёт криво.
+
+## Как cuframes сравнить с DeepStream?
+
+|                       | cuframes                  | NVIDIA DeepStream         |
+| --------------------- | ------------------------- | ------------------------- |
+| Scope                 | Library (только data plane) | Полный SDK + runtime    |
+| Лицензия              | LGPL-2.1+                 | Проприетарный EULA        |
+| Footprint             | ~140 KB `.so`             | Multi-GB runtime          |
+| Lock-in               | Никакого — pipeline твой  | Pipeline = DeepStream-плагины |
+| Cross-process sharing | Native (в этом весь смысл) | Внутри одного процесс-дерева |
+| Поддержка             | Best-effort GitHub        | Платная enterprise        |
+| Кривая обучения       | Часы                      | Недели                    |
+
+cuframes **не** пытается заменить DeepStream. Решает одну конкретную задачу: «У меня есть один CUDA decoder и несколько процессов, которые хотят decoded frame'ы без re-decode».
+
+## Почему не GStreamer?
+
+В GStreamer есть элементы `cudaupload` / `cudadownload`, но нет zero-copy cross-process модели — каждый consumer тянет свой pipeline. Можно нахачить с `shmsink` / `shmsrc`, но теряется CUDA-residency (frame'ы прыгают через CPU-память). cuframes именно этот roundtrip и избегает.
+
+## Почему не DMA-BUF + V4L2?
+
+Это современный kernel-native путь, и он работает cross-vendor. Мы его рассматривали. Почему пошли через CUDA VMM:
+
+- Целевая платформа — NVIDIA-only (существующий CUDA decode pipeline).
+- Интеграция DMA-BUF с CUDA требует `EGLStream`-interop boilerplate — больше кода чем путь VMM + POSIX FD.
+- Поддержка драйверами варьируется по возрасту GPU; CUDA VMM стабилен с CUDA 10.2.
+
+Если твой проект cross-vendor — DMA-BUF правильный выбор, и cuframes тебе не подходит.
+
+## Можно использовать на Windows?
+
+Нет. Реализация использует POSIX shared memory (`shm_open`), Unix sockets и передачу file descriptors через `SCM_RIGHTS`. Порт на Windows потребует:
+
+- Windows-примитивы shared memory (`CreateFileMapping`).
+- Другой механизм шаринга FD (`DuplicateHandle` через named pipe).
+- CUDA VMM `WIN32_HANDLE` вместо `POSIX_FILE_DESCRIPTOR`.
+
+В roadmap не стоит. PR'ы приветствуются.
+
+## Publisher и consumer могут быть на разных машинах?
+
+Нет. POSIX file descriptors не ходят по сети. Для cross-host шаринга видео нужен другой transport: RTSP, SRT, NDI, или своя реализация через NIC RDMA. cuframes строго same-host.
+
+## Что если publisher упадёт пока consumer читает?
+
+Следующий `cuframes_subscriber_next()` consumer'а вернёт `CUFRAMES_ERR_DISCONNECTED`. Consumer должен:
+
+1. Вызвать `cuframes_subscriber_destroy()`.
+2. Подождать (например, 1–2 секунды back-off).
+3. Попытаться `cuframes_subscriber_create()` снова с тем же key.
+
+FFmpeg demuxer (`cuframes://`) делает это автоматически — каждые 2 секунды re-subscribe'ится и возвращает `EAGAIN` в avformat-слой вместо `EOF`. Смотри `libavformat/cuframesdec.c` если переписываешь это под другой framework.
+
+## Можно несколько publisher'ов на один key?
+
+Нет. Каждый key (`/run/cuframes/<key>.sock` + `/dev/shm/cuframes-<key>`) маппится ровно в одного publisher'а. Publisher на `create()` детектит «уже работает» через shm header + проверку PID liveness и падает с `CUFRAMES_ERR_ALREADY_EXISTS`.
+
+Для load balancing или HA-сценариев придётся накладывать свою схему именования сверху (например, `cam1-primary`, `cam1-backup`, и логика на стороне consumer'а выбирать к кому подписаться).
+
+## Сколько subscriber'ов держит publisher?
+
+`CUFRAMES_MAX_SUBSCRIBERS = 32` (ограничено bitmap'ом). Поднятие потребует протокольной смены версии, потому что bitmap лежит в SHM header.
+
+На практике мы держим 2–3 subscriber'а на камеру (NVR + AI inference + grid-composer). 32 — с запасом.
+
+## Вопросы по лицензии
+
+LGPL-2.1+. cuframes можно использовать в коммерческих закрытых продуктах при условии:
+
+- Динамическая линковка (`.so` заменяема конечным пользователем).
+- Любые модификации самого `libcuframes` публикуются под LGPL.
+
+Static linking затягивает весь проект под LGPL — обычно не то что нужно.
+
+Если LGPL несовместим с твоим use case'ом (например, embedded без возможности заменить library), напиши до того как форкнуть.
+
+## Куда репортить баги / контрибьютить?
+
+- Source-репо: https://git.goldix.org/gx/cuframes
+- Issues: тот же репо, `/issues`
+- Гайд по контрибьюту: `CONTRIBUTING.md` в source-дереве
+
+Этот документационный сайт живёт отдельно: https://git.goldix.org/gx/cuframes-docs — фиксы опечаток и контентные PR'ы — через ту же Gitea.
@@ -0,0 +1,7 @@
+{
+  "label": "Быстрый старт",
+  "position": 2,
+  "collapsible": true,
+  "collapsed": false,
+  "link": null
+}
@@ -0,0 +1,110 @@
+---
+title: Первый publisher
+sidebar_position: 2
+---
+
+# Первый publisher
+
+Минимальный publisher, который экспонирует CUDA-resident ring из 4 NV12-кадров 1920×1080 и пишет в него 10 frame'ов. Каждый frame заполняется однобайтовым pattern'ом через `cudaMemsetAsync`, чтобы subscriber потом мог end-to-end проверить содержимое.
+
+Это упрощённая версия [`spike/smoke_v04/smoke_pub.c`](https://git.goldix.org/gx/cuframes/src/branch/main/spike/smoke_v04/smoke_pub.c) из репозитория cuframes.
+
+## Исходник
+
+```c
+/* first_publisher.c — publish 10 NV12 1920x1080 frames, then exit. */
+#include <cuframes/cuframes.h>
+#include <cuda_runtime.h>
+#include <stdio.h>
+#include <time.h>
+
+int main(int argc, char **argv) {
+    const char *key = argc > 1 ? argv[1] : "mykey";
+
+    cuframes_publisher_config_t cfg = {0};
+    cfg.key         = key;
+    cfg.width       = 1920;
+    cfg.height      = 1080;
+    cfg.format      = CUFRAMES_FORMAT_NV12;
+    cfg.ownership   = CUFRAMES_OWNERSHIP_LIBRARY;
+    cfg.ring_size   = 4;
+    cfg.policy      = CUFRAMES_POLICY_DROP_OLDEST;
+    cfg.cuda_device = 0;
+
+    cuframes_publisher_t *pub = NULL;
+    int r = cuframes_publisher_create(&cfg, &pub);
+    if (r != CUFRAMES_OK) {
+        fprintf(stderr, "create: %s\n", cuframes_strerror(r));
+        return 1;
+    }
+
+    cudaStream_t stream;
+    cudaStreamCreate(&stream);
+
+    for (int i = 0; i < 10; i++) {
+        void *ptr = NULL;
+        if ((r = cuframes_publisher_acquire(pub, &ptr)) != CUFRAMES_OK) break;
+
+        /* NV12 = Y plane + interleaved UV plane = width*height*3/2 bytes */
+        cudaMemsetAsync(ptr, (uint8_t)i, 1920 * 1080 * 3 / 2, stream);
+
+        r = cuframes_publisher_publish(pub, stream, cuframes_now_ns());
+        if (r != CUFRAMES_OK) break;
+
+        struct timespec ts = {.tv_nsec = 40000000};  /* 25 fps */
+        nanosleep(&ts, NULL);
+    }
+
+    cudaStreamDestroy(stream);
+    cuframes_publisher_destroy(pub);
+    return r == CUFRAMES_OK ? 0 : 1;
+}
+```
+
+## Разбор
+
+**`cuframes_publisher_config_t cfg = {0};`** — всегда zero-initialise. В структуре есть поле `_reserved[4]`, которое должно оставаться нулевым ради forward ABI compatibility.
+
+**`cfg.key = "mykey"`** — уникально именует publisher в пределах хоста. Это становится path-компонентом Unix socket'а (`/run/cuframes/mykey.sock`) и POSIX SHM-сегмента (`/dev/shm/cuframes-mykey`). Два publisher'а не могут шарить один key — второй получит `CUFRAMES_ERR_ALREADY_EXISTS`.
+
+**`cfg.format = CUFRAMES_FORMAT_NV12`** плюс `width`/`height` — геометрия frame'а фиксирована на всю жизнь publisher'а. Subscriber'ы увидят ровно эти размеры.
+
+**`cfg.ownership = CUFRAMES_OWNERSHIP_LIBRARY`** — library сама аллоцирует CUDA ring buffer. Альтернатива, `CUFRAMES_OWNERSHIP_EXTERNAL`, позволяет передать уже аллоцированные device pointer'ы (обычно из FFmpeg-пула `AVHWFramesContext`). Подробнее — [Концепции → Ownership modes](/docs/concepts/ownership-modes).
+
+**`cfg.ring_size = 4`** — количество frame-слотов. 2 — минимум, 4 — разумный default, 16 — потолок. С policy `DROP_OLDEST` медленный consumer просто пропускает frame'ы; publisher никогда не блокируется.
+
+**`cuframes_publisher_acquire(pub, &ptr)`** — возвращает CUDA device pointer на следующий writable slot. Действителен только до соответствующего вызова `publish()`.
+
+**`cudaMemsetAsync(ptr, ..., stream)`** — заполняем frame на CUDA stream'е по вашему выбору. **Не нужно** синхронизировать этот stream перед publish. Library внутри `publish()` сделает `cudaEventRecord` на тот же stream, а каждый subscriber вызовет `cudaStreamWaitEvent` на своём stream'е перед чтением. Это и есть cross-process контракт синхронизации — см. [Концепции → Cross-process sync](/docs/concepts/cross-process-sync).
+
+**`cuframes_publisher_publish(pub, stream, pts_ns)`** — делает slot видимым subscriber'ам. `pts_ns` непрозрачен для library; рекомендуемый источник — `cuframes_now_ns()` (CLOCK_MONOTONIC в наносекундах).
+
+**Cleanup** — `cuframes_publisher_destroy()` закрывает socket, unlink'ает SHM-сегмент и освобождает CUDA-пул.
+
+## Компиляция
+
+```bash
+gcc -O2 -I/usr/local/include -I/usr/local/cuda/include \
+    -o first_publisher first_publisher.c \
+    -L/usr/local/lib -lcuframes \
+    -L/usr/local/cuda/lib64 -lcudart -lcuda
+```
+
+Если cuframes собран без `cmake --install`, направь `-I` и `-L` на своё `build/`-дерево (`-I./include -L./build/libcuframes`).
+
+## Запуск
+
+```bash
+./first_publisher mykey
+```
+
+Пока процесс работает, ему принадлежат:
+
+- `/run/cuframes/mykey.sock` — handshake / control socket
+- `/dev/shm/cuframes-mykey` — shared metadata header (SHM)
+
+Оба удаляются при чистом shutdown. Если publisher падает, stale-файлы могут остаться; следующий старт пересоздаёт их.
+
+## Дальше
+
+Открой второй терминал и подключи [Первый subscriber](./first-subscriber.md), который прочитает эти frame'ы и проверит pattern. Полное описание API — [Reference → C API](/docs/reference/api-c).
@@ -0,0 +1,145 @@
+---
+title: Первый subscriber
+sidebar_position: 3
+---
+
+# Первый subscriber
+
+Минимальный subscriber, который подключается к publisher'у из [Первого publisher](./first-publisher.md), читает 10 frame'ов и проверяет, что каждый байт каждого frame'а совпадает с pattern'ом со стороны publisher'а.
+
+Это упрощённая версия [`spike/smoke_v04/smoke_sub.c`](https://git.goldix.org/gx/cuframes/src/branch/main/spike/smoke_v04/smoke_sub.c).
+
+## Исходник
+
+```c
+/* first_subscriber.c — connect, read 10 frames, verify pattern. */
+#include <cuframes/cuframes.h>
+#include <cuda_runtime.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+int main(int argc, char **argv) {
+    const char *key = argc > 1 ? argv[1] : "mykey";
+
+    cuframes_subscriber_config_t cfg = {0};
+    cfg.key                = key;
+    cfg.consumer_name      = "first-sub";
+    cfg.mode               = CUFRAMES_MODE_NEWEST_ONLY;
+    cfg.cuda_device        = 0;
+    cfg.connect_timeout_ms = 5000;
+
+    cuframes_subscriber_t *sub = NULL;
+    int r = cuframes_subscriber_create(&cfg, &sub);
+    if (r != CUFRAMES_OK) {
+        fprintf(stderr, "create: %s\n", cuframes_strerror(r));
+        return 1;
+    }
+
+    cudaStream_t stream;
+    cudaStreamCreate(&stream);
+
+    const size_t sample = 1024;             /* check first 1 KiB of each frame */
+    uint8_t *host = malloc(sample);
+
+    int frames = 0, good = 0;
+    while (frames < 10) {
+        cuframes_frame_t *f = NULL;
+        r = cuframes_subscriber_next(sub, stream, &f, 2000);
+        if (r != CUFRAMES_OK) {
+            fprintf(stderr, "next: %s\n", cuframes_strerror(r));
+            break;
+        }
+
+        cudaMemcpyAsync(host, cuframes_frame_cuda_ptr(f), sample,
+                        cudaMemcpyDeviceToHost, stream);
+        cudaStreamSynchronize(stream);
+
+        int mismatch = 0;
+        for (size_t i = 1; i < sample; i++)
+            if (host[i] != host[0]) mismatch++;
+        if (mismatch == 0) good++;
+
+        printf("seq=%lu pts_ns=%lld pitch_y=%d byte0=0x%02x mismatch=%d\n",
+               (unsigned long)cuframes_frame_seq(f),
+               (long long)cuframes_frame_pts_ns(f),
+               cuframes_frame_pitch_y(f),
+               host[0], mismatch);
+
+        cuframes_subscriber_release(sub, f);
+        frames++;
+    }
+
+    free(host);
+    cudaStreamDestroy(stream);
+    cuframes_subscriber_destroy(sub);
+    return (good == frames && frames > 0) ? 0 : 1;
+}
+```
+
+## Разбор
+
+**`cfg.key`** — должен точно совпадать с key publisher'а. Subscriber находит publisher'а через `connect()` на `/run/cuframes/<key>.sock`.
+
+**`cfg.consumer_name = "first-sub"`** — идентифицирует subscriber'а в ACK-bitmap publisher'а. Должен быть уникален среди живых subscriber'ов одного publisher'а; коллизия возвращает `CUFRAMES_ERR_ALREADY_EXISTS`. Если передать `NULL`, library сгенерирует `subscriber-<pid>-<random>`. Publisher принимает до 32 одновременных subscriber'ов.
+
+**`cfg.mode = CUFRAMES_MODE_NEWEST_ONLY`** — subscriber всегда прыгает к самому свежему опубликованному frame'у и пропускает все frame'ы, которые publisher успел произвести пока обрабатывался предыдущий вызов `next()`. Используй `CUFRAMES_MODE_STRICT_ORDER`, если обязательно нужно видеть каждый frame по seq; в этом режиме ring overflow всплывает как `CUFRAMES_ERR_DISCONNECTED`.
+
+**`cfg.connect_timeout_ms = 5000`** — сколько `create()` ждёт появления publisher'а. `0` — fail immediately с `CUFRAMES_ERR_NOT_FOUND`, `-1` — ждать вечно.
+
+**`cfg.cuda_device`** — должен совпадать с `cuda_device` publisher'а. CUDA IPC handle'ы не переносимы между девайсами.
+
+**`cuframes_subscriber_next(sub, stream, &f, 2000)`** — блокируется до 2 секунд ради следующего frame'а. Library внутри вызывает `cudaStreamWaitEvent` на твоём `stream`'е против publisher'овского record-event, поэтому любой kernel, запущенный на `stream`'е после возврата `next()`, гарантированно увидит writes producer'а. Если читаешь через `cudaMemcpyDeviceToHost`, ставь его в очередь на тот же stream — это и делает cross-process sync рабочим.
+
+**Frame-accessor'ы** — `cuframes_frame_cuda_ptr()` (device pointer, read-only), `cuframes_frame_format()`, `cuframes_frame_pitch_y()` / `_pitch_uv()`, `cuframes_frame_seq()` (монотонный per publisher), `cuframes_frame_pts_ns()` (CLOCK_MONOTONIC со стороны publisher'а). После `release()` handle становится недействительным — не вызывай на нём accessor'ы.
+
+**`cuframes_subscriber_release(sub, f)`** — отдаёт ACK на slot publisher'у. Publisher'у это нужно только при `CUFRAMES_POLICY_STRICT_WAIT`; при дефолтном `DROP_OLDEST` всё равно обязательно вызывать, чтобы освободить handle на стороне consumer'а. NULL — no-op.
+
+## Компиляция
+
+```bash
+gcc -O2 -I/usr/local/include -I/usr/local/cuda/include \
+    -o first_subscriber first_subscriber.c \
+    -L/usr/local/lib -lcuframes \
+    -L/usr/local/cuda/lib64 -lcudart -lcuda
+```
+
+## Запуск
+
+В одном терминале запусти publisher с предыдущей страницы:
+
+```bash
+./first_publisher mykey
+```
+
+В другом терминале:
+
+```bash
+./first_subscriber mykey
+```
+
+Ожидаемый вывод — 10 строк с растущим `seq` и `mismatch=0`.
+
+## Замечание про Docker
+
+Subscriber должен шарить **IPC namespace** publisher'а (чтобы `shm_open` смог открыть тот же header `/dev/shm/cuframes-mykey`). PID namespace шарить **не нужно** — это изменение v0.4. Старое требование v0.1 / v0.2 `--pid=container:<publisher>` ушло, потому что handle'ы ездят как POSIX file descriptors через Unix socket (`SCM_RIGHTS`), а не как CUDA IPC mem-handle'ы.
+
+```bash
+docker run --rm --runtime=nvidia \
+  --ipc=container:cuframes-pub \
+  -v /run/cuframes:/run/cuframes:ro \
+  gx/cuframes:0.4 ./first_subscriber mykey
+```
+
+## Обработка disconnect
+
+Если publisher завершился или упал во время твоего loop'а, следующий `cuframes_subscriber_next()` вернёт `CUFRAMES_ERR_DISCONNECTED`. Handle после этого мёртв — уничтожь его и (опционально) переподключись:
+
+```c
+if (r == CUFRAMES_ERR_DISCONNECTED) {
+    cuframes_subscriber_destroy(sub);
+    sub = NULL;
+    /* sleep + retry cuframes_subscriber_create(&cfg, &sub) */
+}
+```
+
+Паттерн reconnect, включая back-off и нюансы переиспользования `consumer_name`, разобран в [Концепции → Reconnect](/docs/concepts/reconnect).
@@ -0,0 +1,110 @@
+---
+title: Установка
+sidebar_position: 1
+---
+
+# Установка
+
+cuframes работает **только на Linux**. IPC-механизм опирается на POSIX shared memory и передачу file descriptors через `SCM_RIGHTS` поверх Unix sockets. Windows, macOS и WSL2 не поддерживаются.
+
+Также нужен NVIDIA GPU с compute capability ≥ 7.5 (Turing или новее) и CUDA 12+ driver. Полная матрица — в [Концепции → Требования](/docs/concepts/requirements).
+
+## Вариант 1 — готовый Docker-образ (рекомендуется для первого знакомства)
+
+Runtime-образ поставляет `libcuframes.so` и bridge-утилиту `cuframes-rtsp-source` поверх `nvidia/cuda:12.4.1-runtime`.
+
+```bash
+docker pull gx/cuframes:0.4
+```
+
+Smoke-check:
+
+```bash
+docker run --rm --runtime=nvidia gx/cuframes:0.4 \
+  /usr/local/bin/cuframes-rtsp-source --help
+```
+
+Чтобы запустить publisher и subscriber в двух контейнерах, контейнер **publisher** должен стартовать с `--ipc=shareable`, а **subscriber** — шарить его IPC namespace через `--ipc=container:<publisher>`. Шарить PID namespace **не нужно** начиная с v0.4 — handle'ы передаются как POSIX file descriptors через Unix socket.
+
+```bash
+# Publisher
+docker run -d --name cuframes-pub --runtime=nvidia --ipc=shareable \
+  -v /run/cuframes:/run/cuframes \
+  gx/cuframes:0.4 \
+  /usr/local/bin/cuframes-rtsp-source --rtsp 'rtsp://...' --key cam1
+
+# Subscriber
+docker run --rm --runtime=nvidia \
+  --ipc=container:cuframes-pub \
+  -v /run/cuframes:/run/cuframes:ro \
+  gx/cuframes:0.4 \
+  /usr/local/bin/sub_count --key cam1 --max-frames 100
+```
+
+Подробнее о правилах namespace — [Концепции → Docker IPC](/docs/concepts/docker-ipc).
+
+## Вариант 2 — сборка из исходников
+
+### Требования к сборке
+
+| | Минимум |
+|---|---|
+| CUDA Toolkit | 12.0 |
+| NVIDIA driver | 525 |
+| CMake | 3.20 |
+| GCC / Clang | 11 / 14 |
+| FFmpeg dev libs | libavcodec, libavformat, libavutil (только для `cuframes-rtsp-source`) |
+
+На Ubuntu 22.04 / 24.04:
+
+```bash
+sudo apt-get install -y \
+  build-essential cmake ninja-build pkg-config \
+  libavcodec-dev libavformat-dev libavutil-dev
+```
+
+### Конфигурация и сборка
+
+```bash
+git clone https://git.goldix.org/gx/cuframes.git
+cd cuframes
+cmake -B build -S . -G Ninja -DCMAKE_BUILD_TYPE=Release
+cmake --build build --parallel
+```
+
+На выходе:
+
+- `build/libcuframes/libcuframes.so` — shared library
+- `build/tools/cuframes-rtsp-source/cuframes-rtsp-source` — RTSP-bridge
+- `build/examples/sub_count/sub_count` — референсный subscriber
+
+### Установка system-wide
+
+```bash
+sudo cmake --install build --prefix /usr/local
+sudo ldconfig
+```
+
+Заголовки кладутся в `/usr/local/include/cuframes/`, library — в `/usr/local/lib/`.
+
+### Опции сборки
+
+| Опция | По умолчанию | Примечания |
+|---|---|---|
+| `BUILD_TOOLS` | `ON` | `cuframes-rtsp-source` (нужны FFmpeg dev libs) |
+| `BUILD_EXAMPLES` | `ON` | референсный subscriber `sub_count` |
+| `BUILD_TESTING` | `ON` | unit- и stress-тесты |
+| `BUILD_FFMPEG_FILTER` | `OFF` | out-of-tree, требует пропатченное FFmpeg-дерево |
+| `BUILD_PYTHON_BINDINGS` | `OFF` | в планах |
+
+## Вариант 3 — apt / dpkg пакеты
+
+Появятся к релизу v1.0. До этого — Docker-образ или сборка из исходников.
+
+## Проверить установку
+
+```bash
+cuframes-rtsp-source --help
+```
+
+Если бинарь лежит в `PATH` и печатает свой usage-баннер, runtime подключён. Чтобы убедиться, что сама library загружается из твоего кода, переходи к [Первому publisher](./first-publisher.md).
@@ -0,0 +1,7 @@
+{
+  "label": "Интеграция",
+  "position": 4,
+  "collapsible": true,
+  "collapsed": false,
+  "link": null
+}
@@ -0,0 +1,125 @@
+---
+title: FFmpeg cuframes:// demuxer
+sidebar_position: 1
+---
+
+# FFmpeg `cuframes://` demuxer
+
+cuframes поставляет два FFmpeg input demuxer'а, оба — патчем поверх upstream FFmpeg:
+
+- **`cuframes`** — подписывается на decoded NV12 frame ring и отдаёт его как `rawvideo` stream (один stream на URL).
+- **`cuframes_packets`** — подписывается на encoded packet ring и отдаёт его как `h264` / `hevc` byte-stream, с `extradata` из publisher'ского handshake.
+
+Оба demuxer'а — чистые consumer'ы. Они никогда не декодируют, никогда не re-encode'ят, и никогда не лезут в сеть — реальный pull RTSP происходит один раз в publisher'е (обычно [`cuframes-rtsp-source`](/docs/getting-started/install)). FFmpeg просто цепляется к уже существующему ring'у через Unix socket.
+
+## URL-схема
+
+```text
+cuframes://<key>            # decoded NV12 frames (raw GPU surfaces)
+cuframes_packets://<key>    # encoded H264/HEVC packets (Annex-B)
+```
+
+`<key>` — это key publisher'а, та же строка что передаётся в `cuframes_publisher_create()` или в `--key` для `cuframes-rtsp-source`. Legacy-форма `cuframes:<key>` (без `//`) тоже принимается.
+
+Два типа ring'ов независимы. Один publisher может экспонировать оба: decoded frame'ы для composer'ов / AI, encoded packets для recorder'ов которые хотят пропустить re-encode.
+
+## Что делает demuxer
+
+При open demuxer:
+
+1. Подключается к `/run/cuframes/<key>.sock`.
+2. Получает N POSIX file descriptors через `SCM_RIGHTS` (frame-слоты) плюс shm metadata header.
+3. Для `cuframes`: импортирует каждый FD как CUDA VMM allocation, объявляет один `AV_PIX_FMT_NV12` stream на publisher'ском width / height / framerate.
+4. Для `cuframes_packets`: читает `extradata` (SPS/PPS для H264, VPS/SPS/PPS для HEVC) из handshake и объявляет один `AV_CODEC_ID_H264` или `AV_CODEC_ID_HEVC` stream.
+
+В read-loop demuxer poll'ит publisher'ский `global_seq`, копирует frame / packet в pipeline и проставляет `pts` по publisher'ским часам. Wire format — [Protocol reference](/docs/reference/protocol).
+
+## Примеры pipeline'ов
+
+### Один источник — decoded ring → NVENC → MPEG-TS
+
+Re-encode опубликованной камеры в H.264 MPEG-TS UDP stream. NVDEC в этом ffmpeg не запускается — publisher уже один раз декодировал, этот процесс просто NVENC'ит shared NV12 surface.
+
+```bash
+ffmpeg -hwaccel cuda -hwaccel_output_format cuda \
+  -f cuframes -i cuframes://cam-parking \
+  -c:v h264_nvenc -preset p4 -b:v 4M \
+  -f mpegts udp://192.168.88.50:5000
+```
+
+### Packet ring — true copy, без decode и без encode
+
+Когда нужно только записать или restream'ить существующую камеру и decoded пиксели не интересны — подписывайся на **packet** ring через `-c:v copy`. NVDEC и NVENC оба простаивают.
+
+```bash
+ffmpeg -f cuframes_packets -i cuframes_packets://cam-parking \
+  -c:v copy -f mp4 /var/recordings/cam-parking.mp4
+```
+
+Это самый дешёвый способ размножить один decode на N recorder'ов.
+
+### Композиция — 4 входа в CUDA-grid
+
+Короткий пример multi-input wiring. Полный справочник по фильтру — [FFmpeg `vf_cuda_grid` filter](/docs/integration/ffmpeg-filter).
+
+```bash
+ffmpeg \
+  -hwaccel cuda -hwaccel_output_format cuda \
+  -f cuframes -i cuframes://cam1 \
+  -f cuframes -i cuframes://cam2 \
+  -f cuframes -i cuframes://cam3 \
+  -f cuframes -i cuframes://cam4 \
+  -filter_complex "[0:v][1:v][2:v][3:v]cuda_grid=layout=quad[out]" \
+  -map "[out]" -c:v h264_nvenc -preset p4 \
+  -f rtsp rtsp://127.0.0.1:8554/grid
+```
+
+## Сборка / запуск с пропатченным FFmpeg
+
+Demuxer'ы живут в `libavformat/cuframesdec.c` и `libavformat/cuframes_packetsdec.c` и в upstream FFmpeg их нет. Есть два варианта.
+
+### Вариант A — готовый Docker-образ
+
+Production-tested образ опубликован как `ffmpeg-vf-cuda-grid:phase8`. Внутри — пропатченный ffmpeg binary, `libcuframes.so` и фильтр `vf_cuda_grid`. Референсный `docker compose`-setup, который связывает его с publisher-контейнером, лежит в репо `localhost-infra` — копируй и адаптируй, не пинни как публичную зависимость.
+
+```bash
+docker run --rm --runtime=nvidia \
+  --ipc=container:cuframes-publisher \
+  ffmpeg-vf-cuda-grid:phase8 \
+  ffmpeg -f cuframes -i cuframes://cam1 -c:v copy -f null -
+```
+
+Флаг `--ipc=container:...` подключает IPC namespace publisher'а, чтобы POSIX shm header был виден. Шарить PID namespace **не требуется** начиная с cuframes v0.4.
+
+### Вариант B — собрать самому
+
+Используй toolchain `ffmpeg-builds` (fork `BtbN/FFmpeg-Builds`). Скрипт `scripts.d/50-libcuframes.sh` клонирует cuframes, собирает его статически, и `--enable-libcuframes` автоматически добавляется когда активен addin `cuframes`.
+
+```bash
+git clone <ffmpeg-builds repo> ffmpeg-builds
+cd ffmpeg-builds
+ADDITIONAL_SCRIPTS=50-libcuframes.sh ./build.sh <target> <variant>
+```
+
+Пропатченное дерево исходников ffmpeg (с обоими demuxer'ами и фильтром) лежит в `ffmpeg-fresh/`. Если хочешь вендорить патч в свой fork FFmpeg, скопируй три файла (`libavformat/cuframesdec.c`, `libavformat/cuframes_packetsdec.c`, `libavfilter/vf_cuda_grid.c`) плюс соответствующие `Makefile` и регистрации в `allformats.c` / `allfilters.c`.
+
+## Поведение при reconnect
+
+Publisher'ы приходят и уходят — рестарты контейнеров, reboot RTSP-камер, обновление `cuframes-rtsp-source` на host'е. Demuxer спроектирован **переживать рестарт publisher'а без обрушивания FFmpeg pipeline'а**.
+
+Когда subscriber видит `CUFRAMES_ERR_DISCONNECTED`:
+
+- Demuxer **не** возвращает `AVERROR_EOF`.
+- Освобождает мёртвый subscriber и пробует `cuframes_subscriber_create()` снова, с rate-limit'ом **одна попытка в 2 секунды**.
+- Пока идёт reconnect, `av_read_frame()` возвращает `AVERROR(EAGAIN)`. Pipeline блокируется, но остаётся живым.
+- При успехе demuxer пишет в лог `cuframes: reconnected to '<key>'` на уровне `INFO` и возобновляет доставку frame'ов.
+
+Это важно для long-running consumer'ов (NVR recorder'ов, RTSP-restreamer'ов, NVENC-composer'ов), которым иначе нужен был бы внешний supervisor чтобы рестартовать ffmpeg на каждый publisher-икни.
+
+Если на disconnect реально нужен EOF — например, one-shot transcode который должен остановиться когда источник умер — обёртывай demuxer через `-timeout` или собственный watchdog. Built-in поведение — «ждать вечно», не «fail fast».
+
+## См. также
+
+- [Первый publisher](/docs/getting-started/first-publisher) — минимальный C-producer.
+- [Protocol reference](/docs/reference/protocol) — wire format и handshake.
+- [Фильтр `vf_cuda_grid`](/docs/integration/ffmpeg-filter) — multi-camera композиция.
@@ -0,0 +1,103 @@
+---
+title: FFmpeg vf_cuda_grid filter (композиция CCTV-grid)
+sidebar_position: 2
+---
+
+# FFmpeg `vf_cuda_grid` filter
+
+`vf_cuda_grid` — CUDA-ускоренный multi-input grid-composer для CCTV-видеостен: N камер на входе, один скомпонованный frame на выходе, всё на GPU.
+
+:::note Не часть cuframes
+`vf_cuda_grid` — **отдельный out-of-tree FFmpeg-фильтр**, который сопровождается как самостоятельный проект ([git.goldix.org/gx/vf-cuda-grid](https://git.goldix.org/gx/vf-cuda-grid)). Так получилось, что он поставляется в том же пропатченном FFmpeg-образе что и cuframes-demuxer'ы, и это канонический consumer, показывающий ради чего cuframes вообще *нужен*. Но он не часть library cuframes и живёт своим release cycle.
+
+Эта страница — короткий указатель, чтобы integration story читалась end-to-end. Полные опции, layout-шаблоны, ZMQ command protocol и детали рендера overlay'ев — в upstream-проекте.
+:::
+
+## Зачем оно
+
+Типичный CCTV TV-wall pipeline нуждается в:
+
+- Декодировать 4–16 камер один раз (это делают cuframes publisher'ы через NVDEC).
+- Скомпоновать их в один grid-layout (single / quad / main + previews / …).
+- Re-encode'нуть итоговый frame через NVENC и отправить RTSP / SRT на дисплеи.
+
+Делать это стандартными FFmpeg-фильтрами означает CPU round-trip'ы (`hwdownload` / `hwupload`), которые убивают frame rate на 4K. `vf_cuda_grid` остаётся в CUDA-памяти end-to-end — input surfaces приходят напрямую из cuframes ring'а, размещаются в одном output CUDA frame через `cuMemcpy2DAsync`, и передаются в `h264_nvenc` без выхода из GPU.
+
+## Быстрый пример — 4 камеры в 2×2 quad
+
+```bash
+ffmpeg \
+  -hwaccel cuda -hwaccel_output_format cuda \
+  -f cuframes -i cuframes://cam1 \
+  -f cuframes -i cuframes://cam2 \
+  -f cuframes -i cuframes://cam3 \
+  -f cuframes -i cuframes://cam4 \
+  -filter_complex "[0:v][1:v][2:v][3:v]cuda_grid=layout=quad[out]" \
+  -map "[out]" \
+  -c:v h264_nvenc -preset p4 -tune ll -b:v 8M \
+  -f rtsp rtsp://127.0.0.1:8554/wall
+```
+
+Все четыре входа — decoded NV12 surfaces, расшаренные из publisher-процессов. Фильтр CPU-память не видит вообще. Encoder получает CUDA frame.
+
+## Layout-шаблоны
+
+Текущие шаблоны фильтра (определены в `libavfilter/vf_cuda_grid.c`):
+
+- `single` — один вход, full canvas. Полезно для runtime-смены layout'а (стартовать с quad, переключиться на fullscreen одной камеры).
+- `dual_h`, `dual_v` — два входа, горизонтальный / вертикальный split.
+- `quad` — четыре входа, 2×2.
+- `main_plus_preview` — одна большая ячейка плюс ряд из трёх поменьше.
+
+Выбирается через `layout=<name>`. Размеры output-canvas по умолчанию берутся из разрешения первого входа и настраиваются через опции фильтра.
+
+## Runtime-управление через ZMQ
+
+Фильтр экспонирует две команды через FFmpeg'овский механизм `process_command` (его прокидывает фильтр `zmq`):
+
+- `set_layout <name>` — поменять активный layout-шаблон без рестарта pipeline'а. Ограничено `max_cells` (compile-time максимум, чтобы input-pad'ы не нужно было перерегистрировать).
+- `add_overlay <args>` — нарисовать цветной прямоугольник / текст / image overlay внутри конкретной ячейки (например, красная рамка на камере с движением, timestamp на ячейке 0).
+
+Формат аргументов и полная семантика overlay'ев — в документации upstream-проекта.
+
+:::tip Quoting ZMQ-аргументов
+Когда передаёшь ZMQ-команды из shell'а, команда + args **должны** быть одним quoted-аргументом — иначе FFmpeg'овский ZMQ-фильтр распарсит только команду и молча отбросит остальное. Об этот грабли наступило достаточно людей, чтобы упомянуть здесь:
+
+```bash
+# правильно
+echo "Parsed_cuda_grid_0 set_layout main_plus_preview" | zmqsend
+
+# неправильно — args молча отбрасываются
+echo Parsed_cuda_grid_0 set_layout main_plus_preview | zmqsend
+```
+:::
+
+## Production pipeline
+
+Реальный CCTV-wall pipeline (несколько publisher'ов, фильтр, NVENC, RTSP-сервер плюс ZMQ-контроллер для смены layout'а) связан в репо `localhost-infra` как `docker-compose.phase7.yml` (или что там сейчас актуально — смотри в репо). Это самый свежий референс, какие флаги реально работают вместе и какие image-теги пиннить.
+
+**Не копируй compose-сниппет с этой страницы** — пока ты это читаешь, image-теги уже разъедутся. Открой живой файл в `localhost-infra` и адаптируй оттуда.
+
+Общая форма:
+
+```text
+[ cuframes-rtsp-source × N ]   (publisher'ы, по одному на камеру)
+            │
+            │  cuframes://camN  (decoded NV12, расшарено через SCM_RIGHTS)
+            ▼
+[ ffmpeg + vf_cuda_grid ]      (composer, GPU-side)
+            │
+            │  H264 NVENC
+            ▼
+[ mediamtx ]                   (RTSP / WebRTC fan-out)
+            │
+            ▼
+[ TV / browser ]
+```
+
+Отдельный небольшой controller-процесс говорит с composer'ом через ZMQ и переключает layout'ы по действиям пользователя.
+
+## См. также
+
+- [Demuxer `cuframes://`](/docs/integration/ffmpeg-demuxer) — как входы попадают в FFmpeg.
+- Upstream-проект фильтра — [git.goldix.org/gx/vf-cuda-grid](https://git.goldix.org/gx/vf-cuda-grid) — полный справочник, инструкции по сборке, ZMQ-протокол, типы overlay'ев.
@@ -0,0 +1,47 @@
+---
+title: Python bindings
+sidebar_position: 3
+---
+
+# Python bindings
+
+**Статус: запланировано на v0.5+. Ещё не поставляется.**
+
+Эта страница — placeholder, чтобы integration story честно говорила, что есть, а чего нет. Если Python-доступ к cuframes нужен сегодня, читай [Workaround для v0.4](#workaround-для-v04) ниже.
+
+## Что появится в v0.5+
+
+Небольшой Python-пакет `cuframes`, распространяемый как wheel, который даёт:
+
+- **`ctypes`-bindings** поверх существующего C API (`cuframes_publisher_create`, `cuframes_subscriber_create`, `cuframes_acquire`, `cuframes_release` и т. д.). Никакого нового ABI, ни SWIG, ни C++-обёртки.
+- **Zero-copy доступ к NumPy / PyTorch** к decoded NV12 surface через [CUDA Array Interface](https://numba.readthedocs.io/en/stable/cuda/cuda_array_interface.html). Subscribed frame экспонируется как CuPy / PyTorch tensor, указывающий на ту же GPU-память что писал publisher — никакого `cudaMemcpy` на host и никакой копии на девайсе.
+- **Context-manager API** для acquire / release, чтобы frame-слоты не текли при исключениях.
+
+Целевые use case'ы:
+
+- **PyTorch inference на shared frames** — детектор / классификатор подписывается на `cuframes://camN`, получает tensor, запускает `model(tensor)` напрямую. Сегодня это требует либо re-decode RTSP-stream'а, либо копирования frame'ов через Unix pipe.
+- **OpenCV CUDA processing** — `cv2.cuda_GpuMat` строится из cuframes-pointer'а, после чего любая cv2.cuda-операция (resize, color convert, optical flow) выполняется in place.
+- **Быстрое прототипирование** — Jupyter-ноутбук, который подписывается на живую камеру, вырезает region of interest и визуализирует его без поднятия полного FFmpeg pipeline'а.
+
+CPU-side NumPy fallback добавлять не планируется. Frame'ы — это GPU surfaces; если нужны на CPU — делай явный `tensor.cpu()` и принимай копирование.
+
+## Workaround для v0.4
+
+Пока bindings'ов нет, поддерживаемый путь — звать `libcuframes.so` напрямую из Python через `ctypes` или `cffi`. C API маленький (≈ 10 функций) и стабилен внутри v0.x релиза.
+
+Реалистично говоря, большинство v0.4-deployment'ов не зовут cuframes из Python вообще. Они используют существующий C / FFmpeg-путь:
+
+- Для inference: подписаться на **packet** ring через `ffmpeg -f cuframes_packets -i cuframes_packets://camN` и пайпить decoded frame'ы в твой Python-процесс. Zero-copy теряется, но pipeline сегодня рабочий.
+- Для прототипирования: брать C-примеры в [`examples/`](https://git.goldix.org/gx/cuframes) как стартовую точку.
+
+Если zero-copy путь на Python нужен *прямо сейчас*, придётся писать `ctypes`-обёртки самому. Зеркаль прототипы из [`include/cuframes/cuframes.h`](/docs/reference/protocol), держи handle'ы opaque (`void *`), используй `cuMemAlloc` / CuPy `UnownedMemory` чтобы view'ить импортированную VMM-allocation. Жди шероховатостей — bindings v0.5 существуют именно потому, что руками это удовольствие сомнительное.
+
+## Отслеживать прогресс
+
+Roadmap и заметки по milestone Python-пакета лежат в [`ROADMAP.md`](https://git.goldix.org/gx/cuframes) в репо cuframes. v0.5 также гейтится HEVC packet-ring путём и небольшим ABI cleanup'ом, поэтому bindings — не единственное на этом релизе.
+
+## См. также
+
+- [Установка](/docs/getting-started/install) — поставить `libcuframes.so` в систему, чтобы bindings'ам было что грузить.
+- [Protocol reference](/docs/reference/protocol) — поверхность C API, которую bindings зеркалят.
+- [FFmpeg `cuframes://` demuxer](/docs/integration/ffmpeg-demuxer) — сегодняшний практичный путь чтобы данные cuframes попали в любой не-C consumer.
@@ -0,0 +1,59 @@
+---
+title: Что такое cuframes
+sidebar_position: 1
+slug: /intro
+---
+
+# cuframes
+
+**Zero-copy передача decoded видеокадров через CUDA, шарится между процессами — без shared pid namespace.**
+
+Чистый C library (LGPL-2.1+) для передачи decoded NV12 frames между Linux-процессами через CUDA VMM + POSIX file descriptors. Никаких re-encode, ни CPU-side memcpy, ни Unix-pipe сериализации — consumer получает указатель на ту же GPU-память что и producer.
+
+```mermaid
+flowchart LR
+  RTSP[RTSP camera] --> Pub[Publisher<br/>NVDEC → VMM pool]
+  Pub -- POSIX FD via SCM_RIGHTS --> Sub1[Subscriber 1<br/>FFmpeg detect]
+  Pub -- POSIX FD via SCM_RIGHTS --> Sub2[Subscriber 2<br/>FFmpeg compose]
+  Pub -- POSIX FD via SCM_RIGHTS --> Sub3[Subscriber 3<br/>AI inference]
+```
+
+## Статус
+
+**v0.4 — ранняя стадия, но проверено в продакшене.** Одно реальное развёртывание в CCTV-системе (4 IP-камеры, NVENC re-encode chain, 25 fps). Не enterprise-ready: один maintainer, нет платной поддержки, ABI ломался 4 раза за месяц разработки.
+
+Подходит если ты строишь свой video pipeline и понимаешь риски OSS-библиотеки на ранней стадии. Не подходит как drop-in замена DeepStream / GStreamer для enterprise-сценариев.
+
+## Зачем
+
+Типичный кейс — один декодер RTSP-камеры, несколько consumer'ов (NVR-запись, AI-детектор, live-композитор для TV). Без cuframes:
+
+- **Наивный путь:** каждый consumer открывает свой RTSP-stream → N×NVDEC + N×NIC. Если камер 10, потребителей 3 — это 30 параллельных декодов.
+- **DeepStream:** работает, но vendor lock-in, тяжёлый runtime, лицензионные ограничения.
+- **CUDA IPC handles напрямую:** требуют shared PID namespace между процессами. Frigate, Docker-стек на k8s — это часто несовместимо.
+
+cuframes решает третью проблему: один decoder publish'ит, любое число consumer'ов subscribe'ятся через unix socket, FD'ы передаются через `SCM_RIGHTS`, никакого pid sharing не нужно.
+
+## Что cuframes не делает
+
+- **Encoded video.** Frame ring — это decoded NV12. Для encoded H.264/H.265 packet stream есть отдельный packet ring (Annex-B byte stream через POSIX shm).
+- **Cross-host.** Только same-machine — POSIX FD не передаётся через сеть.
+- **Cross-vendor.** Только NVIDIA — `cuMemCreate` это CUDA API. AMD HIP / Intel oneAPI пока не поддерживаются.
+- **Windows.** Только Linux (POSIX shm + Unix sockets + SCM_RIGHTS).
+- **HA / failover.** Один producer per key — если падает, consumers видят `CUFRAMES_ERR_DISCONNECTED` и сами решают что делать.
+
+## Архитектура за 30 секунд
+
+- **Producer** (`cuframes_publisher_create`) — аллоцирует N×frame VMM pool (`cuMemCreate(POSIX_FILE_DESCRIPTOR)`), открывает Unix socket `/run/cuframes/<key>.sock`, открывает POSIX shm `/dev/shm/cuframes-<key>` для metadata header.
+- **Subscriber** (`cuframes_subscriber_create`) — подключается к socket, в handshake получает N file descriptors через `SCM_RIGHTS`, импортирует их через `cuMemImportFromShareableHandle`, mmap'ит shm header.
+- **Publish loop:** producer `acquire()` → memcpy в slot → `publish(stream, pts)` который делает `cuStreamSynchronize` + atomic update `slot.seq` + `global_seq`.
+- **Consume loop:** subscriber poll'ит atomic `global_seq`, читает frame через DtoD copy в свой stream.
+- **Sync:** `cuStreamSynchronize` у producer'а перед publish гарантирует hardware coherence — consumer читает данные через atomic load без CUDA events (modern simplification от `v0.4`).
+
+Полная спецификация — [Protocol reference](/docs/reference/protocol).
+
+## Дальше
+
+- [Установка](/docs/getting-started/install) — apt / docker / from source.
+- [Первый publisher](/docs/getting-started/first-publisher) — пример на C в 30 строк.
+- [Интеграция с FFmpeg demuxer](/docs/integration/ffmpeg-demuxer) — URL-схема `cuframes://key`.
@@ -0,0 +1,7 @@
+{
+  "label": "Reference",
+  "position": 5,
+  "collapsible": true,
+  "collapsed": true,
+  "link": null
+}
@@ -0,0 +1,51 @@
+{
+  "name": "cuframes-site",
+  "version": "0.1.0",
+  "private": true,
+  "scripts": {
+    "docusaurus": "docusaurus",
+    "start": "docusaurus start",
+    "build": "docusaurus build",
+    "swizzle": "docusaurus swizzle",
+    "deploy": "docusaurus deploy",
+    "clear": "docusaurus clear",
+    "serve": "docusaurus serve",
+    "write-translations": "docusaurus write-translations",
+    "write-heading-ids": "docusaurus write-heading-ids",
+    "typecheck": "tsc"
+  },
+  "dependencies": {
+    "@docusaurus/core": "^3.10.1",
+    "@docusaurus/faster": "^3.10.1",
+    "@docusaurus/preset-classic": "^3.10.1",
+    "@docusaurus/theme-mermaid": "^3.10.1",
+    "@easyops-cn/docusaurus-search-local": "^0.55.1",
+    "@mdx-js/react": "^3.0.0",
+    "clsx": "^2.0.0",
+    "prism-react-renderer": "^2.3.0",
+    "react": "^19.0.0",
+    "react-dom": "^19.0.0"
+  },
+  "devDependencies": {
+    "@docusaurus/module-type-aliases": "^3.10.1",
+    "@docusaurus/tsconfig": "^3.10.1",
+    "@docusaurus/types": "^3.10.1",
+    "@types/react": "^19.0.0",
+    "typescript": "~6.0.2"
+  },
+  "browserslist": {
+    "production": [
+      ">0.5%",
+      "not dead",
+      "not op_mini all"
+    ],
+    "development": [
+      "last 3 chrome version",
+      "last 3 firefox version",
+      "last 5 safari version"
+    ]
+  },
+  "engines": {
+    "node": ">=20.0"
+  }
+}
@@ -0,0 +1,16 @@
+import type { SidebarsConfig } from '@docusaurus/plugin-content-docs';
+
+// The sidebar is auto-generated from the `docs/` directory structure.
+// Order of categories and pages is controlled by:
+//   - `sidebar_position` in the page frontmatter;
+//   - `_category_.json` in the folder (label, position, collapsed, collapsible).
+const sidebars: SidebarsConfig = {
+  mainSidebar: [
+    {
+      type: 'autogenerated',
+      dirName: '.',
+    },
+  ],
+};
+
+export default sidebars;
@@ -0,0 +1,30 @@
+/**
+ * Global styles and overrides for Infima theme variables.
+ * The dagstack brand palette is built around the gradient logo
+ * `#4A49DB → #3E38C9` (see site/static/img/logo.svg).
+ */
+
+:root {
+  /* Primary — indigo from the top of the gradient logo. */
+  --ifm-color-primary: #4A49DB;
+  --ifm-color-primary-dark: #3E38C9;
+  --ifm-color-primary-darker: #3530B5;
+  --ifm-color-primary-darkest: #282495;
+  --ifm-color-primary-light: #6261E1;
+  --ifm-color-primary-lighter: #7574E5;
+  --ifm-color-primary-lightest: #9696EB;
+  --ifm-code-font-size: 95%;
+  --docusaurus-highlighted-code-line-bg: rgba(74, 73, 219, 0.08);
+}
+
+/* For the dark theme — lighter shades for readability on a dark background. */
+[data-theme='dark'] {
+  --ifm-color-primary: #8988E9;
+  --ifm-color-primary-dark: #7574E5;
+  --ifm-color-primary-darker: #6261E1;
+  --ifm-color-primary-darkest: #4A49DB;
+  --ifm-color-primary-light: #9E9DEE;
+  --ifm-color-primary-lighter: #BFBEF4;
+  --ifm-color-primary-lightest: #D6D5F8;
+  --docusaurus-highlighted-code-line-bg: rgba(137, 136, 233, 0.12);
+}
@@ -0,0 +1,12 @@
+// This file is not used by "docusaurus start/build" commands.
+// It is here to improve your IDE experience (type-checking, autocompletion...),
+// and can also run the package.json "typecheck" script manually.
+{
+  "extends": "@docusaurus/tsconfig",
+  "compilerOptions": {
+    "baseUrl": ".",
+    "ignoreDeprecations": "6.0",
+    "strict": true
+  },
+  "exclude": [".docusaurus", "build"]
+}