Merge pull request 'Python bindings (pybind11) — Phase 0 v1' (#7 ) from feat/python-bindings into main

python: DLPack + health stats + CUDA stream + docs (tasks #199-#202)
#199 DLPack export: - frame.dlpack_y() / .dlpack_uv() — explicit multi-plane access для NV12 - frame.__dlpack__() / __dlpack_device__() — protocol для torch/cupy - Capsule deleter правильно держит refcount на frame_keep_alive, releases shape/strides arrays. CUDA pointer принадлежит frame. #200 Health/stats counters: - frames_received, timeouts, errors — per-call counters - last_seq, gap_count — proxy для drop count (NEWEST_ONLY mode) - last_frame_pts_ns - stats() — snapshot dict для MQTT health publish - counted в pybind layer т.к. C API не expose'ит ring_occupancy #201 Per-subscriber CUDA stream + thread-safety: - consumer_stream kwarg в subscribe() — int (cudaStream_t pointer) - subscriber.consumer_stream property - Thread-safety contract в docstring CuframesSubscriber - next_frame() передаёт consumer_stream_ в cuframes_subscriber_next #202 Smoke test + docs: - 10/10 pytest passed (расширен +2 теста на consumer_stream) - docs/python.md (~250 строк): quick start, API reference, integration с PyTorch/CuPy, reconnect-loop pattern, per-stream usage, pitch alignment, thread-safety, error taxonomy, backpressure, Phase 0 limitations Verify build + tests: cmake -B build-python -DBUILD_PYTHON_BINDINGS=ON cmake --build build-python -j pytest python/tests/ -v # 10/10 Закрывает Phase 0 issue gx/cuframes#6. Разблокирует goldix-smart-home/yolo-world-detector Phase 1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-13 21:34:29 +01:00 · 2026-06-13 21:33:21 +01:00 · 2026-06-13 21:23:42 +01:00 · 2026-06-13 21:19:03 +01:00 · 2026-06-13 12:59:04 +01:00 · 2026-06-03 04:27:24 +01:00
28 changed files with 2515 additions and 311 deletions
@@ -117,3 +117,95 @@ cd build && cmake -DBUILD_TESTING=ON ..  && cmake --build . && ctest -R stress -
 Production деplo замеры — см. интеграционные guides:
 - [docs/integration.md](docs/integration.md) — cctv-processor C++ pipeline
 - [filter/README.md](filter/README.md) — FFmpeg demuxer (Frigate setup)
+
+---
+
+## Real-world production deployment (2026-05-19, v0.2.0)
+
+**Setup**: 4 Dahua IP-камеры (HEVC main 1920×1080 / 2688×1520, 25 fps) → 3
+одновременных consumer'а на одном RTX 5090 хосте:
+- **Frigate** detect (ONNX D-FINE-S, 640×480) + record (full-res H.265 mp4)
+- **cctv-backend** custom C++ mosaic processor (composes 4×grid → RTSP output для TV)
+
+### Before → after (measured production, идентичный workload)
+
+| Метрика | Без cuframes | С cuframes v0.2 dual-input | Reduction |
+|---|---:|---:|---:|
+| **RTSP connections к камерам** | 12 (4 cam × 3 consumer) | **4** (publishers only) | **−67%** |
+| **NVDEC sessions** | ~8 (decode на каждый consumer) | **4** (publishers only) | **−50%** |
+| **Camera-side bandwidth** | ~34 Mbps (main+main+sub per cam) | **~16 Mbps** (main per cam) | **−54%** |
+| **PCIe D2H copies (consumer side)** | ~346 MB/s (decoded frames → host) | **~0** (zero-copy CUDA IPC) | **−100%** |
+| **Frigate ffmpeg с прямым RTSP** | 8 (detect+record × 4) | **0** (all через cuframes) | **−100%** |
+
+### Live nvidia-smi metrics в running system
+
+```
+GPU SM:     4-5%   (compute: detector + cuframes consumers)
+GPU NVDEC:  2-4%   (без cuframes ожидаемо было 15-25%)
+GPU NVENC:  0-1%
+```
+
+### VRAM breakdown (measured)
+
+| Component | VRAM |
+|---|---:|
+| 4× cuframes publishers (3× FHD ring + 1× 2688×1520 для LPR) | **4.4 GB** |
+| cctv-backend (composer + grid output) | 1.0 GB |
+| frigate.embeddings_manager (face + LPR ONNX models) | 1.6 GB |
+| frigate.detector:onnx (D-FINE-S COCO) | 0.6 GB |
+| **Total cuframes-stack VRAM** | **~7.7 GB** |
+
+Из них на сам cuframes accounting — только **4.4 GB** в publishers (ring buffers +
+NVDEC decode buffers). Consumers (Frigate, cctv-backend) держат свои CUDA
+contexts независимо.
+
+### Network bandwidth (real tcpdump, 10-sec sample)
+
+**31.5 Mbps** от camera subnet (4 cameras → R9), измерено через
+`tcpdump -w cam-traffic.pcap` за 10 секунд.
+
+Breakdown approximate:
+- 4 publishers × main HEVC RTP/UDP: **~16 Mbps** (cuframes core)
+- go2rtc on-demand streams (Frigate UI live preview, если открыт): **0-10 Mbps**
+- ONVIF discovery, RTSP keepalives, NTP-from-cameras: **~1-2 Mbps**
+
+Без cuframes тот же setup (cctv-backend + Frigate detect + Frigate record × 4
+camera) дал бы **~45-50 Mbps** (главное: record path забирал отдельный
+main stream от каждой camera).
+
+### Camera-side benefits
+
+Dahua/Hikvision камеры обычно cap'нуты на 4-5 одновременных RTSP streams.
+До cuframes setup (4 cam × 3 RTSP) делал каждую camera на **60-75% capacity**
+её RTSP server'а. После — **20-25%**, headroom на 2-3 дополнительных
+consumer'а без замены оборудования.
+
+### Что **сохранено** (важно)
+
+- **Качество записи**: record path через `cuframes_packets://` это **passthrough**
+  (`-c:v copy`), bit-exact original encoded stream от камеры. Frigate пишет mp4
+  с full-resolution оригинала, без re-encode.
+- **Latency**: <2 ms publisher → consumer (cuframes IPC) vs ~50-80 ms RTSP setup
+  latency для каждого нового consumer.
+- **Backward compatibility**: v0.2 publishers принимают v1 subscribers
+  (frames-only), rolling upgrade.
+
+### Hardware-agnostic projection (для другого setup)
+
+| If you have | Expected reduction |
+|---|---|
+| 16 cameras × 2 consumers | 32 → 16 NVDEC (−50%), 32 → 16 RTSP (−50%) |
+| 8 cameras × 3 consumers | 24 → 8 NVDEC (−67%), 24 → 8 RTSP (−67%) |
+| 4 cameras × 4 consumers (multi-AI pipeline) | 16 → 4 NVDEC (−75%), 16 → 4 RTSP (−75%) |
+
+Reduction масштабируется **линейно** с N (consumers per camera). v0.1 (frames
+only) сэкономит NVDEC; v0.2 (frames + packets) **дополнительно** сэкономит
+RTSP connections для record/mux consumers.
+
+### Что **НЕ** сэкономлено (честно)
+
+- **Disk space**: запись остаётся full-resolution H.265 mp4. Cuframes не сжимает.
+- **Detector inference latency**: ONNX/TensorRT detector работает на decoded
+  frames независимо от source. Cuframes только меняет где decode произошёл.
+- **Camera RTSP server CPU**: сама камера всё равно encode'ит видео. Cuframes
+  reduces **consumer-side** load, не producer-side.
@@ -1,7 +1,7 @@
 cmake_minimum_required(VERSION 3.20)
 project(cuframes
-    VERSION 0.2.0
-    DESCRIPTION "Zero-copy frame sharing via CUDA IPC"
+    VERSION 0.4.0
+    DESCRIPTION "Zero-copy frame sharing via CUDA VMM + POSIX FD"
    LANGUAGES C CXX CUDA
 )

@@ -39,3 +39,7 @@ endif()
 if(BUILD_TOOLS)
    add_subdirectory(tools/cuframes-rtsp-source)
 endif()
+
+if(BUILD_PYTHON_BINDINGS)
+    add_subdirectory(python)
+endif()
@@ -75,27 +75,19 @@ ETA: 1-2 недели focused работы.

 Open questions: какой memory-type — `memory:CUDAMemory` (mainline) vs `memory:NVMM` (NVIDIA DeepStream-specific). Возможно два варианта/build flags.

-### `vf_cuda_grid` — FFmpeg filter с runtime grid composition
+### `vf_cuda_grid` — **выделен в отдельный продукт `gx/vf-cuda-grid`** ([repo](https://git.goldix.org/gx/vf-cuda-grid))

-CCTV mosaic composition как FFmpeg filter, **полностью на GPU**. Заменяет custom C++ GridComposer (см. [gx/cctv#22](https://git.goldix.org/gx/cctv/issues/22) — performance investigation cctv-processor: CPU round-trip pipeline).
+FFmpeg filter для GPU-native video grid composition + control-plane sidecar
+(ZeroMQ/MQTT/HTTP/HA Discovery). Дизайн зафиксирован, см.
+[`gx/vf-cuda-grid` docs/design.md](https://git.goldix.org/gx/vf-cuda-grid/src/branch/main/docs/design.md)
+и [epic issue #1](https://git.goldix.org/gx/vf-cuda-grid/issues/1).

-| Capability | Зачем |
-|---|---|
-| Filter принимает N cuda-frames (через `[in0][in1][in2]...` filter inputs) | Композиция в одном filter graph без custom code |
-| Output — один cuda-frame с N cells в layout | Прямой вход в `hwdownload` или `h264_nvenc` |
-| Layout templates (`single`, `quad`, `main_plus_preview`, `nine_grid`, ...) | Конфигурируемые из CLI или filter command'ом |
-| `sendcmd` / API для runtime smena layout'а | Не нужно teardown filter graph для переключения сетки |
-| Per-cell overlays (text, bbox) через side data в AVFrame | Frigate detection/LPR/face — overlay внутри pipeline |
-| Полностью CUDA-side: scale/composition/text rendering | Zero CPU round-trip, frame не покидает VRAM |
+Cuframes остаётся frame source provider для vf-cuda-grid в нашей экосистеме
+(но vf-cuda-grid работает и с любым другим CUDA frame source — стандартный FFmpeg).

-Это превращает cuframes из IPC-библиотеки в полноценную **GPU-native video routing platform**. Эстетически близко к NVIDIA DeepStream `nvstreammux` + `nvmultistreamtiler`, но open-source и с conventional FFmpeg-stack.
-
-Open questions:
- Filter input mode: pull-based (filter pull'ает N inputs) или push-based (через external lock-step). FFmpeg filter API больше pull-friendly.
- Text rendering в CUDA — `vf_drawtext` имеет CPU path; нужен либо GPU font-renderer (Pango/freetype + texture upload), либо CPU-precomputed glyph atlases.
- Runtime layout commands через filter `process_command` API.
-
-Это **большой scope** — отдельная major version (v0.5+) или standalone проект.
+Закрывает [`gx/cctv#22`](https://git.goldix.org/gx/cctv/issues/22) Phase 4
+(end-to-end GPU pipeline для cctv-processor mosaic composer) после Phase 4 vf-cuda-grid +
+миграция cctv-processor GridComposer → vf_cuda_grid filter.

 ## v1.0 — Stable ABI 📋

@@ -16,7 +16,8 @@
 #     /usr/local/bin/cuframes-rtsp-source --rtsp ... --key ...

 # ─── Build stage ─────────────────────────────────────────────────────────
-FROM nvidia/cuda:13.0.3-cudnn-devel-ubuntu24.04 AS build
+# CUDA 12.4 — matching ffmpeg-vf-cuda-grid base + Frigate stable-tensorrt
+FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 AS build

 ENV DEBIAN_FRONTEND=noninteractive
 RUN apt-get update && apt-get install -y --no-install-recommends \
@@ -36,12 +37,13 @@ RUN cmake -B build -S . -G Ninja \
    && cmake --build build --parallel

 # ─── Runtime stage ────────────────────────────────────────────────────────
-FROM nvidia/cuda:13.0.3-cudnn-runtime-ubuntu24.04 AS runtime
+FROM nvidia/cuda:12.4.1-runtime-ubuntu22.04 AS runtime

 ENV DEBIAN_FRONTEND=noninteractive
 RUN apt-get update && apt-get install -y --no-install-recommends \
-        libavcodec60 libavformat60 libavutil58 \
+        libavcodec58 libavformat58 libavutil56 \
        ca-certificates \
+        mosquitto-clients \
    && rm -rf /var/lib/apt/lists/*

 # libcuframes.so → /usr/local/lib (стандартный путь для ldconfig)
@@ -0,0 +1,284 @@
+# cuframes Python bindings
+
+Status: **v0.4 — Phase 0 alpha** (issue [gx/cuframes#6](http://server:3000/gx/cuframes/issues/6))
+
+Python пакет `cuframes` — pybind11-обёртка над C ABI libcuframes. Цель —
+позволить downstream ML/CV пайплайнам (yolo-world-detector, zone-motion,
+custom скриптам) подписываться на cuframes **без CPU round-trip**: получать
+NV12 frames прямо как CUDA pointer / `torch.Tensor` (DLPack export, zero-copy
+из VRAM publisher'а в VRAM consumer'а).
+
+## Установка
+
+Standalone wheel (рекомендуемый):
+
+```bash
+cd cuframes/python/
+pip install -e . --no-build-isolation
+```
+
+Через корневой CMake:
+
+```bash
+cmake -B build -DBUILD_PYTHON_BINDINGS=ON
+cmake --build build -j
+```
+
+## Quick start
+
+```python
+import cuframes
+
+print(cuframes.version_string())  # "0.4.0"
+
+with cuframes.subscribe("cam-parking",
+                        consumer_name="yolo-world",
+                        connect_timeout_ms=5000) as sub:
+    with sub.next_frame(timeout_ms=1000) as frame:
+        print(f"{frame.width}x{frame.height} "
+              f"format={frame.format} seq={frame.seq}")
+```
+
+## API
+
+### `cuframes.subscribe(key, ...)`
+
+Создать подписку на publisher. Возвращает `CuframesSubscriber`.
+
+| Параметр | Тип | Default | Назначение |
+|---|---|---|---|
+| `key` | `str` | (required) | Имя publisher'а (`"cam-parking"` и т.п.) |
+| `consumer_name` | `str \| None` | `None` (auto-generated) | Идентификатор подписки |
+| `mode` | `SubscriberMode` | `NEWEST_ONLY` | `NEWEST_ONLY` skip'ит промежуточные frames, `STRICT_ORDER` — все по порядку |
+| `cuda_device` | `int` | `0` | CUDA device id |
+| `connect_timeout_ms` | `int` | `-1` (бесконечно) | Сколько ждать publisher'а |
+| `consumer_stream` | `int` | `0` (default stream) | `cudaStream_t` как pointer |
+
+### `CuframesSubscriber`
+
+Контекст-менеджер. Methods/properties:
+
+```python
+sub.next_frame(timeout_ms=-1)  # → CuframesFrame
+sub.close()                     # idempotent
+
+# read-only properties
+sub.key                # str
+sub.consumer_name      # str
+sub.mode               # SubscriberMode
+sub.cuda_device        # int
+sub.consumer_stream    # int (cudaStream_t ptr)
+sub.closed             # bool
+
+# health / stats (Phase 0 counters)
+sub.frames_received    # int
+sub.timeouts           # int
+sub.errors             # int
+sub.last_seq           # int (sequence number последнего frame'а)
+sub.gap_count          # int (proxy для drop count в NEWEST_ONLY)
+sub.last_frame_pts_ns  # int
+sub.stats()            # dict — snapshot всех counters для MQTT publish
+```
+
+### `CuframesFrame`
+
+Контекст-менеджер. Properties (read-only):
+
+```python
+frame.cuda_ptr   # int (uintptr_t)
+frame.format     # PixelFormat
+frame.width      # int
+frame.height     # int
+frame.pitch_y    # int — pitch Y plane (важно — может быть > width!)
+frame.pitch_uv   # int
+frame.seq        # int — sequence number у publisher'а
+frame.pts_ns     # int — CLOCK_MONOTONIC у publisher'а
+frame.released   # bool
+
+# DLPack export (zero-copy)
+frame.dlpack_y()       # capsule — Y plane как 2D uint8 GPU tensor
+frame.dlpack_uv()      # capsule — UV plane (только NV12)
+frame.__dlpack__()     # protocol для torch.from_dlpack(frame)
+frame.__dlpack_device__()  # (kDLCUDA=2, device_id)
+```
+
+## Интеграция с PyTorch
+
+```python
+import torch
+import cuframes
+
+with cuframes.subscribe("cam-parking", connect_timeout_ms=5000) as sub:
+    with sub.next_frame() as frame:
+        # Single-plane (default — Y plane для NV12)
+        y_tensor = torch.from_dlpack(frame)
+        # Multi-plane explicit
+        y = torch.from_dlpack(frame.dlpack_y())   # shape=[H, W] uint8
+        uv = torch.from_dlpack(frame.dlpack_uv()) # shape=[H/2, W] uint8
+
+        # Y plane уже в VRAM — никаких copy. Можно сразу feed в NN.
+        y_float = y.float() / 255.0  # будет на CUDA device
+```
+
+## Интеграция с CuPy
+
+```python
+import cupy
+import cuframes
+
+with cuframes.subscribe("cam-parking", connect_timeout_ms=5000) as sub:
+    with sub.next_frame() as frame:
+        y_array = cupy.from_dlpack(frame.dlpack_y())  # cupy.ndarray на GPU
+```
+
+## Pattern: reconnect-loop для долгоживущего consumer'а
+
+```python
+import time
+import cuframes
+
+def consume_camera(key: str, on_frame):
+    while True:
+        try:
+            with cuframes.subscribe(key, connect_timeout_ms=5000) as sub:
+                while True:
+                    try:
+                        with sub.next_frame(timeout_ms=1000) as frame:
+                            on_frame(frame)
+                    except cuframes.CuframesFrameTimeout:
+                        # просто нет новых кадров — продолжаем ждать
+                        continue
+        except cuframes.CuframesPublisherGone:
+            # publisher умер / перезапускается — переподписываемся
+            print(f"publisher {key} gone, reconnect через 1s")
+            time.sleep(1)
+        except cuframes.CuframesError as e:
+            # фатальная ошибка — логируем и продолжаем
+            print(f"error: {e!r}")
+            time.sleep(5)
+```
+
+## Per-subscriber CUDA stream
+
+В продакшене на 4+ камеры каждый subscriber должен иметь свой stream —
+иначе `cudaStreamWaitEvent` сериализует всех consumer'ов через default
+stream.
+
+С `cuda-python`:
+
+```python
+from cuda import cudart
+import cuframes
+
+err, stream = cudart.cudaStreamCreate()
+assert err == cudart.cudaError_t.cudaSuccess
+
+with cuframes.subscribe("cam-parking", consumer_stream=int(stream)) as sub:
+    ...
+```
+
+С `torch.cuda.Stream`:
+
+```python
+import torch
+import cuframes
+
+stream = torch.cuda.Stream()
+with cuframes.subscribe("cam-parking",
+                        consumer_stream=stream.cuda_stream) as sub:
+    with torch.cuda.stream(stream):
+        with sub.next_frame() as frame:
+            tensor = torch.from_dlpack(frame)
+            # ... inference на этом stream'е ...
+```
+
+## Pitch alignment — важно!
+
+NVDEC отдаёт NV12 с pitch alignment 256 байт. Для камер с шириной не
+кратной 256 (`gate_lpr 2688×1520` → pitch 2688 OK; но представьте `640×480`
+→ pitch обычно 640 байт, но **может быть 768**).
+
+```python
+# WRONG — assume pitch == width
+y = torch.frombuffer(...)  # данные смещены
+
+# RIGHT — использовать DLPack который сам respect'ит strides
+y = torch.from_dlpack(frame.dlpack_y())  # stride учтён правильно
+
+# ALTERNATIVELY — manual через cuda-python с правильным pitch
+ptr = frame.cuda_ptr
+pitch = frame.pitch_y
+height = frame.height
+```
+
+## Thread-safety contract
+
+- Каждый `CuframesSubscriber` принадлежит **одному Python потоку**.
+  Создание и все вызовы (`next_frame`, `close`) — в одном thread.
+- Несколько subscriber'ов в разных потоках — **OK** (каждому свой handle,
+  свой CUDA stream).
+- `CuframesFrame` тоже принадлежит одному потоку — после `release()` его
+  CUDA pointer становится недействительным, доступ из другого потока —
+  undefined behavior.
+- Внутренний GIL отпускается на блокирующих вызовах (`subscriber_create`,
+  `next_frame`) — другие Python потоки могут выполняться.
+
+Для multi-camera в одном процессе используйте `asyncio` или `threading`:
+
+```python
+import threading
+import cuframes
+
+def worker(camera_key):
+    with cuframes.subscribe(camera_key, connect_timeout_ms=5000) as sub:
+        # subscribe в этом же потоке
+        while True:
+            with sub.next_frame(timeout_ms=1000) as frame:
+                process(frame)
+
+for key in ["cam-parking", "cam-front_yard", "cam-gate_lpr", "cam-back_yard"]:
+    threading.Thread(target=worker, args=(key,), daemon=True).start()
+```
+
+## Error taxonomy
+
+Все exception'ы наследуются от `CuframesError`. Конкретные subclass'ы
+позволяют разную обработку:
+
+| Exception | Когда выбрасывается | Что делать |
+|---|---|---|
+| `CuframesPublisherGone` | publisher умер или ещё не стартовал | reconnect-loop |
+| `CuframesFrameTimeout` | timeout без frame'а | продолжать ждать или log'нуть |
+| `CuframesDeviceLost` | CUDA error на cross-process sync | abort, не recoverable |
+| `CuframesShmError` | socket/mmap/IPC error | log, abort или восстановить |
+| `CuframesProtocolMismatch` | версия libcuframes несовместима | пересобрать |
+| `CuframesInvalidArgument` | bug в caller | fix code |
+| `CuframesOutOfMemory` | cudaMalloc fail | reduce работу |
+| `CuframesInternal` | bug в libcuframes | report |
+
+## Backpressure
+
+`next_frame()` blocking call с GIL released. Если consumer медленнее
+publisher'а:
+- В `NEWEST_ONLY` mode (default) — publisher продолжает писать, consumer
+  получает **самый свежий** frame (промежуточные пропускает). `gap_count`
+  растёт.
+- В `STRICT_ORDER` mode — при ring overflow `CuframesPublisherGone` →
+  reconnect.
+
+Frame удерживать долго **нельзя**: в `STRICT_WAIT` policy publisher
+заблокирует ring. Pattern — забрать DLPack, инициировать GPU работу,
+release frame сразу.
+
+## Текущие ограничения (Phase 0)
+
+- Publisher API не обёрнут (только subscriber-side)
+- Packet ring (encoded video) не обёрнут
+- Async callback API не обёрнут
+- `ring_occupancy` / реальный drop count — нет в C API (counted в pybind как
+  `gap_count`, это proxy)
+- Smoke test реального subscribe требует Docker IPC namespace (cuframes
+  socket/SHM живут в namespace publisher'а)
+
+Эти ограничения снимаются по мере необходимости — issues в
+[gx/cuframes](http://server:3000/gx/cuframes).
@@ -36,7 +36,7 @@ extern "C" {
 /* ─────────────────────────────────────────────────────────────────────── */

 #define CUFRAMES_VERSION_MAJOR 0
-#define CUFRAMES_VERSION_MINOR 2
+#define CUFRAMES_VERSION_MINOR 4
 #define CUFRAMES_VERSION_PATCH 0

 /** @brief Runtime-версия библиотеки в формате "MAJOR.MINOR.PATCH". */
@@ -19,7 +19,7 @@ add_library(cuframes_static STATIC ${CUFRAMES_SOURCES})
 foreach(target cuframes cuframes_static)
    target_include_directories(${target}
        PUBLIC
-            $<BUILD_INTERFACE:${CMAKE_SOURCE_DIR}/include>
+            $<BUILD_INTERFACE:${PROJECT_SOURCE_DIR}/include>
            $<INSTALL_INTERFACE:include>
        PRIVATE
            ${CMAKE_CURRENT_SOURCE_DIR}/src
@@ -34,6 +34,7 @@ foreach(target cuframes cuframes_static)
    target_link_libraries(${target}
        PUBLIC
            CUDA::cudart
+            CUDA::cuda_driver   # v0.4 — cuMemCreate/cuMemMap/cuMemExportToShareableHandle
            Threads::Threads
            rt   # для shm_open
    )
@@ -41,7 +42,7 @@ endforeach()

 # Set SOVERSION на shared lib для ABI tracking
 set_target_properties(cuframes PROPERTIES
-    VERSION 0.2.0
+    VERSION 0.4.0
    SOVERSION 0
 )

@@ -1,4 +1,13 @@
-/* Subscriber implementation (sync). */
+/* Subscriber implementation (sync).
+ *
+ * v0.4 — VMM + POSIX FD. Принимает FDs через SCM_RIGHTS в handshake,
+ * импортирует через cuMemImportFromShareableHandle + cuMemMap. Не требует
+ * shared pid/ipc namespace с producer'ом.
+ *
+ * Sync: producer cuStreamSynchronize'ит свой stream перед atomic_store(seq).
+ * Consumer просто читает seq (acquire) и копирует данные через DtoD memcpy —
+ * никаких cudaEventWait не нужно (HW coherence на одном GPU).
+ */

 #include "internal.h"
 #include <errno.h>
@@ -21,14 +30,13 @@ struct cuframes_frame {
    int64_t pts_ns;

    uint32_t slot_idx;
-    void *subscriber;        /* back-ref для release() */
+    void *subscriber;
 };

-/* Opaque packet handle — single-packet pattern (как frame_obj). */
 struct cuframes_packet {
-    uint8_t *data;          /* heap buffer, allocated by subscriber на enable_packets */
-    size_t   capacity;      /* size of allocation */
-    size_t   size;          /* actual payload size */
+    uint8_t *data;
+    size_t   capacity;
+    size_t   size;
    int64_t  pts_ns;
    int64_t  dts_ns;
    uint32_t flags;
@@ -44,24 +52,31 @@ struct cuframes_subscriber {
    cuframes_shm_header_t *hdr;
    char shm_name[80];

-    cudaEvent_t producer_event;
-    void *mapped_ptrs[CUFRAMES_MAX_RING];
+    /* v0.4 — VMM imported slots */
+    CUmemGenericAllocationHandle vmm_handles[CUFRAMES_MAX_RING];
+    CUdeviceptr                  vmm_ptrs[CUFRAMES_MAX_RING];
+    size_t                       vmm_slot_size;
+    int                          imported_count;

    uint32_t assigned_bit;
    uint64_t last_seen_seq;

-    /* Frame pool — переиспользуем одну frame_t structure (single-thread API). */
    struct cuframes_frame frame_obj;
    int frame_busy;

-    /* v0.2 — packet ring (optional, opened via enable_packets). */
    int                 has_pkt_ring;
    cuframes_pkt_ring_t pkt_ring;
-    uint64_t            last_packet_seq;     /* UINT64_MAX = no packet read yet */
+    uint64_t            last_packet_seq;
    struct cuframes_packet packet_obj;
    int                 packet_busy;
 };

+static const char *cu_err_str(CUresult r) {
+    const char *s = NULL;
+    cuGetErrorString(r, &s);
+    return s ? s : "?";
+}
+
 /* ─── Frame accessors ────────────────────────────────────────────────── */
 void *cuframes_frame_cuda_ptr(const cuframes_frame_t *f) { return f ? f->cuda_ptr : NULL; }
 cuframes_format_t cuframes_frame_format(const cuframes_frame_t *f) { return f ? f->format : 0; }
@@ -77,11 +92,13 @@ int64_t cuframes_frame_pts_ns(const cuframes_frame_t *f) { return f ? f->pts_ns

 /* ─── Subscriber create ──────────────────────────────────────────────── */

-static int do_handshake(struct cuframes_subscriber *sub, const char *name) {
-    /* Send HELLO_REQ */
+static int do_handshake(struct cuframes_subscriber *sub, const char *name,
+                         int *fds_out, uint32_t *fd_count_inout,
+                         uint64_t *slot_size_out) {
+    /* Send HELLO_REQ — proto v4 */
    uint8_t buf[CUFRAMES_MAX_MSG_PAYLOAD];
    cuframes_msg_hello_req_t *hreq = (cuframes_msg_hello_req_t *)buf;
-    hreq->proto_version = CUFRAMES_PROTOCOL_V1;
+    hreq->proto_version = CUFRAMES_PROTOCOL_V4;
    uint32_t nl = name ? (uint32_t)strlen(name) : 0;
    if (nl > 31) nl = 31;
    hreq->consumer_name_len = nl;
@@ -98,7 +115,6 @@ static int do_handshake(struct cuframes_subscriber *sub, const char *name) {
                                         buf, plen);
    if (r != CUFRAMES_OK) return r;

-    /* Recv HELLO_RESP */
    uint32_t rmt = 0, rpl = sizeof(buf);
    r = cuframes_internal_recv_msg(sub->sock_fd, &rmt, buf, &rpl, 5000);
    if (r != CUFRAMES_OK) return r;
@@ -106,10 +122,15 @@ static int do_handshake(struct cuframes_subscriber *sub, const char *name) {

    cuframes_msg_hello_resp_t *hresp = (cuframes_msg_hello_resp_t *)buf;
    if (hresp->result != CUFRAMES_OK) return hresp->result;
+    if (hresp->proto_version_actual != CUFRAMES_PROTOCOL_V4) {
+        CUFRAMES_LOG_ERROR("publisher proto v%u — нужен v%u (v0.4)",
+                            hresp->proto_version_actual, CUFRAMES_PROTOCOL_V4);
+        return CUFRAMES_ERR_PROTOCOL;
+    }

    /* Send SUBSCRIBE_REQ */
    uint32_t srbuf[8];
-    srbuf[0] = CUFRAMES_PROTOCOL_V1;
+    srbuf[0] = CUFRAMES_PROTOCOL_V4;
    memset(srbuf + 1, 0, 28);
    r = cuframes_internal_send_msg(sub->sock_fd, CUFRAMES_MSG_SUBSCRIBE_REQ,
                                     srbuf, sizeof(srbuf));
@@ -124,7 +145,29 @@ static int do_handshake(struct cuframes_subscriber *sub, const char *name) {
    if (sresp.result != CUFRAMES_OK) return sresp.result;

    sub->assigned_bit = sresp.assigned_bit;
-    sub->last_seen_seq = sresp.initial_seq;  /* start от текущей точки */
+    sub->last_seen_seq = sresp.initial_seq;
+
+    /* Recv VMM_FDS */
+    cuframes_msg_vmm_fds_t vmm_payload = {0};
+    uint32_t vmm_plen = sizeof(vmm_payload);
+    rmt = 0;
+    r = cuframes_internal_recv_msg_with_fds(sub->sock_fd, &rmt,
+                                              &vmm_payload, &vmm_plen,
+                                              fds_out, fd_count_inout, 5000);
+    if (r != CUFRAMES_OK) {
+        CUFRAMES_LOG_ERROR("recv VMM_FDS: %s", cuframes_strerror(r));
+        return r;
+    }
+    if (rmt != CUFRAMES_MSG_VMM_FDS) {
+        CUFRAMES_LOG_ERROR("expected VMM_FDS got 0x%x", rmt);
+        return CUFRAMES_ERR_PROTOCOL;
+    }
+    if (vmm_payload.fd_count != *fd_count_inout) {
+        CUFRAMES_LOG_ERROR("VMM_FDS: payload fd_count=%u, received %u",
+                            vmm_payload.fd_count, *fd_count_inout);
+        return CUFRAMES_ERR_PROTOCOL;
+    }
+    *slot_size_out = vmm_payload.slot_size_bytes;
    return CUFRAMES_OK;
 }

@@ -141,7 +184,6 @@ int cuframes_subscriber_create(const cuframes_subscriber_config_t *cfg,
    sub->sock_fd = -1;
    sub->shm_fd = -1;

-    /* Generate fallback name if NULL */
    char name_buf[32];
    const char *name = cfg->consumer_name;
    if (!name) {
@@ -150,12 +192,10 @@ int cuframes_subscriber_create(const cuframes_subscriber_config_t *cfg,
        name = name_buf;
    }

-    /* Build paths */
    char sock_path[128];
    int r = cuframes_internal_socket_path(cfg->key, sock_path, sizeof(sock_path));
    if (r != CUFRAMES_OK) { free(sub); return r; }

-    /* Connect with timeout retry */
    int64_t deadline = cfg->connect_timeout_ms > 0
        ? cuframes_now_ns() + (int64_t)cfg->connect_timeout_ms * 1000000LL
        : 0;
@@ -170,63 +210,117 @@ int cuframes_subscriber_create(const cuframes_subscriber_config_t *cfg,
        sub->sock_fd = -1;
        if (cfg->connect_timeout_ms == 0) { r = CUFRAMES_ERR_NOT_FOUND; goto fail; }
        if (deadline && cuframes_now_ns() > deadline) { r = CUFRAMES_ERR_TIMEOUT; goto fail; }
-        struct timespec ts = {.tv_sec = 0, .tv_nsec = 100000000}; /* 100ms */
+        struct timespec ts = {.tv_sec = 0, .tv_nsec = 100000000};
        nanosleep(&ts, NULL);
    }

-    /* Handshake */
-    r = do_handshake(sub, name);
+    /* Handshake (включая VMM_FDS) */
+    int fds[CUFRAMES_MAX_RING];
+    for (int i = 0; i < CUFRAMES_MAX_RING; i++) fds[i] = -1;
+    uint32_t fd_count = CUFRAMES_MAX_RING;
+    uint64_t slot_size = 0;
+    r = do_handshake(sub, name, fds, &fd_count, &slot_size);
    if (r != CUFRAMES_OK) goto fail;

-    /* Open SHM */
+    /* Open SHM (для seq atomics + meta) */
    r = cuframes_internal_shm_name(cfg->key, sub->shm_name, sizeof(sub->shm_name));
-    if (r != CUFRAMES_OK) goto fail;
+    if (r != CUFRAMES_OK) goto fail_close_fds;
    sub->shm_fd = shm_open(sub->shm_name, O_RDWR, 0);
    if (sub->shm_fd < 0) {
        CUFRAMES_LOG_ERROR("shm_open %s: %s", sub->shm_name, strerror(errno));
-        r = CUFRAMES_ERR_IO; goto fail;
+        r = CUFRAMES_ERR_IO; goto fail_close_fds;
    }
    sub->hdr = mmap(NULL, sizeof(cuframes_shm_header_t),
                     PROT_READ | PROT_WRITE, MAP_SHARED, sub->shm_fd, 0);
    if (sub->hdr == MAP_FAILED) {
        sub->hdr = NULL;
-        r = CUFRAMES_ERR_IO; goto fail;
+        r = CUFRAMES_ERR_IO; goto fail_close_fds;
+    }
+    if (sub->hdr->magic != CUFRAMES_MAGIC) {
+        if (sub->hdr->magic == CUFRAMES_MAGIC_LEGACY) {
+            CUFRAMES_LOG_ERROR("publisher uses legacy v0.1-v0.3 SHM — нужен v0.4 publisher");
+        } else {
+            CUFRAMES_LOG_ERROR("SHM magic mismatch: 0x%x", sub->hdr->magic);
+        }
+        r = CUFRAMES_ERR_PROTOCOL; goto fail_close_fds;
+    }
+    if (sub->hdr->proto_version != CUFRAMES_PROTOCOL_V4) {
+        CUFRAMES_LOG_ERROR("SHM proto v%u — нужен v%u",
+                            sub->hdr->proto_version, CUFRAMES_PROTOCOL_V4);
+        r = CUFRAMES_ERR_PROTOCOL; goto fail_close_fds;
    }
-    if (sub->hdr->magic != CUFRAMES_MAGIC) { r = CUFRAMES_ERR_PROTOCOL; goto fail; }

-    /* CUDA setup */
+    /* CUDA driver init + import VMM handles */
+    CUresult cr = cuInit(0);
+    if (cr != CUDA_SUCCESS) {
+        CUFRAMES_LOG_ERROR("cuInit: %s", cu_err_str(cr));
+        r = CUFRAMES_ERR_CUDA; goto fail_close_fds;
+    }
+    /* Ensure a runtime context exists (cudaMemcpyAsync from this pool needs it) */
    cudaError_t cerr = cudaSetDevice(sub->cfg.cuda_device);
    if (cerr != cudaSuccess) {
        CUFRAMES_LOG_ERROR("cudaSetDevice: %s", cudaGetErrorString(cerr));
-        r = CUFRAMES_ERR_CUDA; goto fail;
+        r = CUFRAMES_ERR_CUDA; goto fail_close_fds;
    }

-    /* Open producer's event */
-    cerr = cudaIpcOpenEventHandle(&sub->producer_event, sub->hdr->ipc_event_handle);
-    if (cerr != cudaSuccess) {
-        CUFRAMES_LOG_ERROR("cudaIpcOpenEventHandle: %s", cudaGetErrorString(cerr));
-        r = CUFRAMES_ERR_CUDA; goto fail;
-    }
+    CUmemAccessDesc access = {0};
+    access.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
+    access.location.id = sub->cfg.cuda_device;
+    access.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;

-    /* Open mem handles */
-    int ring = (int)sub->hdr->ring_size;
-    if (ring > CUFRAMES_MAX_RING) ring = CUFRAMES_MAX_RING;
-    for (int i = 0; i < ring; ++i) {
-        cerr = cudaIpcOpenMemHandle(&sub->mapped_ptrs[i],
-                                       sub->hdr->slots[i].mem_handle,
-                                       cudaIpcMemLazyEnablePeerAccess);
-        if (cerr != cudaSuccess) {
-            CUFRAMES_LOG_ERROR("cudaIpcOpenMemHandle slot %d: %s",
-                                i, cudaGetErrorString(cerr));
-            r = CUFRAMES_ERR_CUDA; goto fail;
+    sub->vmm_slot_size = (size_t)slot_size;
+    sub->imported_count = 0;
+    for (uint32_t i = 0; i < fd_count; ++i) {
+        cr = cuMemImportFromShareableHandle(&sub->vmm_handles[i],
+                                             (void *)(uintptr_t)fds[i],
+                                             CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR);
+        if (cr != CUDA_SUCCESS) {
+            CUFRAMES_LOG_ERROR("cuMemImportFromShareableHandle slot %u: %s",
+                                i, cu_err_str(cr));
+            r = CUFRAMES_ERR_CUDA; goto fail_unmap;
        }
+        /* После import можно закрыть FD — kernel держит reference через handle */
+        close(fds[i]);
+        fds[i] = -1;
+
+        cr = cuMemAddressReserve(&sub->vmm_ptrs[i], sub->vmm_slot_size, 0, 0, 0);
+        if (cr != CUDA_SUCCESS) {
+            CUFRAMES_LOG_ERROR("cuMemAddressReserve slot %u: %s",
+                                i, cu_err_str(cr));
+            r = CUFRAMES_ERR_CUDA; goto fail_unmap;
+        }
+        cr = cuMemMap(sub->vmm_ptrs[i], sub->vmm_slot_size, 0,
+                       sub->vmm_handles[i], 0);
+        if (cr != CUDA_SUCCESS) {
+            CUFRAMES_LOG_ERROR("cuMemMap slot %u: %s", i, cu_err_str(cr));
+            r = CUFRAMES_ERR_CUDA; goto fail_unmap;
+        }
+        cr = cuMemSetAccess(sub->vmm_ptrs[i], sub->vmm_slot_size, &access, 1);
+        if (cr != CUDA_SUCCESS) {
+            CUFRAMES_LOG_ERROR("cuMemSetAccess slot %u: %s", i, cu_err_str(cr));
+            r = CUFRAMES_ERR_CUDA; goto fail_unmap;
+        }
+        sub->imported_count++;
    }

-    CUFRAMES_LOG_INFO("subscriber '%s' connected to '%s' (bit=%u, ring=%d)",
-                       name, sub->key, sub->assigned_bit, ring);
+    CUFRAMES_LOG_INFO("subscriber '%s' connected to '%s' (bit=%u, ring=%u, v0.4 VMM)",
+                       name, sub->key, sub->assigned_bit, fd_count);
    *out = sub;
    return CUFRAMES_OK;

+fail_unmap:
+    /* Cleanup partial VMM */
+    for (int i = 0; i < sub->imported_count; i++) {
+        if (sub->vmm_ptrs[i]) {
+            cuMemUnmap(sub->vmm_ptrs[i], sub->vmm_slot_size);
+            cuMemAddressFree(sub->vmm_ptrs[i], sub->vmm_slot_size);
+        }
+        if (sub->vmm_handles[i]) cuMemRelease(sub->vmm_handles[i]);
+    }
+fail_close_fds:
+    for (int i = 0; i < CUFRAMES_MAX_RING; i++) {
+        if (fds[i] >= 0) close(fds[i]);
+    }
 fail:
    cuframes_subscriber_destroy(sub);
    return r;
@@ -242,6 +336,7 @@ int cuframes_subscriber_next(cuframes_subscriber_t *sub,
                              memory_order_acquire) != 0) {
        return CUFRAMES_ERR_DISCONNECTED;
    }
+    (void)consumer_stream;  /* v0.4: producer уже StreamSync'нул, sync не нужен */

    int64_t deadline = (timeout_ms > 0)
        ? cuframes_now_ns() + (int64_t)timeout_ms * 1000000LL
@@ -255,11 +350,9 @@ int cuframes_subscriber_next(cuframes_subscriber_t *sub,
            if (sub->cfg.mode == CUFRAMES_MODE_NEWEST_ONLY) {
                target_seq = gs;
            } else {
-                /* STRICT_ORDER */
                if (sub->last_seen_seq == UINT64_MAX) {
                    target_seq = gs;
                } else if (gs > sub->last_seen_seq + (uint64_t)sub->hdr->ring_size) {
-                    /* Producer overran us. */
                    return CUFRAMES_ERR_DISCONNECTED;
                } else {
                    target_seq = sub->last_seen_seq + 1;
@@ -269,30 +362,22 @@ int cuframes_subscriber_next(cuframes_subscriber_t *sub,
            uint64_t slot_seq = atomic_load_explicit(&sub->hdr->slots[slot_idx].seq,
                                                     memory_order_acquire);
            if (slot_seq != target_seq) {
-                /* Slot уже перезаписан producer'ом — пересчитать */
                continue;
            }
            int64_t pts = atomic_load_explicit(&sub->hdr->slots[slot_idx].pts_ns,
                                                memory_order_acquire);

-            /* Cross-process sync: wait event on consumer's stream */
-            if (consumer_stream) {
-                cudaError_t cerr = cudaStreamWaitEvent((cudaStream_t)consumer_stream,
-                                                         sub->producer_event, 0);
-                if (cerr != cudaSuccess) {
-                    CUFRAMES_LOG_WARN("cudaStreamWaitEvent: %s",
-                                       cudaGetErrorString(cerr));
-                    return CUFRAMES_ERR_CUDA;
-                }
-            } else {
-                /* Synchronize globally — для cudaMemcpyDeviceToHost users */
-                cudaError_t cerr = cudaEventSynchronize(sub->producer_event);
-                if (cerr != cudaSuccess) return CUFRAMES_ERR_CUDA;
+            /* v0.4: producer уже cuStreamSynchronize'нул перед atomic_store seq.
+             * Данные физически в GPU memory к моменту acquire fence. Post-sync
+             * verify оставляем — defending against ring wrap pока мы читали pts. */
+            uint64_t verify_seq = atomic_load_explicit(&sub->hdr->slots[slot_idx].seq,
+                                                       memory_order_acquire);
+            if (verify_seq != target_seq) {
+                continue;
            }

-            /* Fill frame_out */
            struct cuframes_frame *f = &sub->frame_obj;
-            f->cuda_ptr = sub->mapped_ptrs[slot_idx];
+            f->cuda_ptr = (void *)(uintptr_t)sub->vmm_ptrs[slot_idx];
            f->format = (cuframes_format_t)sub->hdr->meta.format;
            f->width = sub->hdr->meta.width;
            f->height = sub->hdr->meta.height;
@@ -308,12 +393,9 @@ int cuframes_subscriber_next(cuframes_subscriber_t *sub,
            return CUFRAMES_OK;
        }

-        /* Не было frame'ов */
        if (timeout_ms == 0) return CUFRAMES_ERR_WOULD_BLOCK;
        if (timeout_ms > 0 && cuframes_now_ns() > deadline) return CUFRAMES_ERR_TIMEOUT;

-        /* Poll-based wait (eventfd — v0.2). 50µs interval — компромисс
-         * latency vs CPU. */
        struct timespec ts = {.tv_sec = 0, .tv_nsec = 50000};
        nanosleep(&ts, NULL);

@@ -329,7 +411,6 @@ int cuframes_subscriber_release(cuframes_subscriber_t *sub,
    if (!frame) return CUFRAMES_OK;
    if (!sub || frame->subscriber != sub) return CUFRAMES_ERR_INVALID_ARG;

-    /* ACK через bitmap */
    if (sub->assigned_bit > 0 && sub->assigned_bit < 64) {
        atomic_fetch_or_explicit(&sub->hdr->slots[frame->slot_idx].ack_bitmap,
                                  1ULL << sub->assigned_bit,
@@ -348,7 +429,6 @@ int cuframes_subscriber_release(cuframes_subscriber_t *sub,
 int cuframes_subscriber_destroy(cuframes_subscriber_t *sub) {
    if (!sub) return CUFRAMES_OK;

-    /* Clear subscriber bit */
    if (sub->hdr && sub->assigned_bit > 0) {
        atomic_fetch_and_explicit(&sub->hdr->subscriber_bitmap,
                                   ~(1ULL << sub->assigned_bit),
@@ -357,15 +437,15 @@ int cuframes_subscriber_destroy(cuframes_subscriber_t *sub) {
                              0, memory_order_release);
    }

-    if (sub->producer_event) cudaEventDestroy(sub->producer_event);
-
-    int ring = sub->hdr ? (int)sub->hdr->ring_size : 0;
-    if (ring > CUFRAMES_MAX_RING) ring = CUFRAMES_MAX_RING;
-    for (int i = 0; i < ring; ++i) {
-        if (sub->mapped_ptrs[i]) cudaIpcCloseMemHandle(sub->mapped_ptrs[i]);
+    /* VMM cleanup */
+    for (int i = 0; i < sub->imported_count; i++) {
+        if (sub->vmm_ptrs[i]) {
+            cuMemUnmap(sub->vmm_ptrs[i], sub->vmm_slot_size);
+            cuMemAddressFree(sub->vmm_ptrs[i], sub->vmm_slot_size);
+        }
+        if (sub->vmm_handles[i]) cuMemRelease(sub->vmm_handles[i]);
    }

-    /* Packet ring cleanup */
    if (sub->has_pkt_ring) {
        cuframes_internal_pkt_ring_destroy(&sub->pkt_ring);
    }
@@ -385,7 +465,6 @@ int cuframes_subscriber_destroy(cuframes_subscriber_t *sub) {
 /* v0.2 — encoded packet ring API (см. docs/protocol.md §10)                */
 /* ─────────────────────────────────────────────────────────────────────── */

-/* Packet accessors */
 const void *cuframes_packet_data(const cuframes_packet_t *p)  { return p ? p->data : NULL; }
 size_t      cuframes_packet_size(const cuframes_packet_t *p)  { return p ? p->size : 0; }
 int64_t     cuframes_packet_pts(const cuframes_packet_t *p)   { return p ? p->pts_ns : 0; }
@@ -395,7 +474,7 @@ uint64_t    cuframes_packet_seq(const cuframes_packet_t *p)   { return p ? p->se

 int cuframes_subscriber_enable_packets(cuframes_subscriber_t *sub) {
    if (!sub) return CUFRAMES_ERR_INVALID_ARG;
-    if (sub->has_pkt_ring) return CUFRAMES_OK; /* idempotent */
+    if (sub->has_pkt_ring) return CUFRAMES_OK;

    char pkt_name[128];
    int r = cuframes_internal_pkt_shm_name(sub->key, pkt_name, sizeof(pkt_name));
@@ -404,8 +483,6 @@ int cuframes_subscriber_enable_packets(cuframes_subscriber_t *sub) {
    r = cuframes_internal_pkt_ring_open(pkt_name, &sub->pkt_ring);
    if (r != CUFRAMES_OK) return r;

-    /* Allocate copy-buffer (max packet size). Используем data_size как
-     * conservative upper bound (publisher гарантирует data_size >= max_packet_size). */
    size_t capacity = sub->pkt_ring.hdr->data_size;
    sub->packet_obj.data = (uint8_t *)malloc(capacity);
    if (!sub->packet_obj.data) {
@@ -414,7 +491,6 @@ int cuframes_subscriber_enable_packets(cuframes_subscriber_t *sub) {
    }
    sub->packet_obj.capacity = capacity;

-    /* Start с last_keyframe_seq - 1 → первый read даст IDR (§10.14). */
    uint64_t kf = atomic_load_explicit(&sub->pkt_ring.hdr->last_keyframe_seq,
                                        memory_order_acquire);
    sub->last_packet_seq = (kf == UINT64_MAX) ? UINT64_MAX : kf - 1;
@@ -427,7 +503,7 @@ int cuframes_subscriber_next_packet(cuframes_subscriber_t *sub,
                                     int32_t timeout_ms) {
    if (!sub || !pkt_out) return CUFRAMES_ERR_INVALID_ARG;
    if (!sub->has_pkt_ring) return CUFRAMES_ERR_NO_PACKET_RING;
-    if (sub->packet_busy) return CUFRAMES_ERR_INVALID_ARG; /* previous packet not released */
+    if (sub->packet_busy) return CUFRAMES_ERR_INVALID_ARG;

    int64_t deadline_ns = (timeout_ms > 0) ?
        cuframes_now_ns() + (int64_t)timeout_ms * 1000000LL : 0;
@@ -456,28 +532,25 @@ int cuframes_subscriber_next_packet(cuframes_subscriber_t *sub,
        }

        if (r == CUFRAMES_ERR_PACKET_OVERRUN) {
-            /* Resync — установить last_seq = last_keyframe_seq - 1, повторить. */
            uint64_t kf = atomic_load_explicit(
                &sub->pkt_ring.hdr->last_keyframe_seq, memory_order_acquire);
            if (kf != UINT64_MAX) {
                sub->last_packet_seq = kf - 1;
            }
-            /* Возвращаем OVERRUN наружу — caller знает что был discontinuity. */
            *pkt_out = NULL;
            return CUFRAMES_ERR_PACKET_OVERRUN;
        }

        if (r != CUFRAMES_ERR_TIMEOUT) {
            *pkt_out = NULL;
-            return r; /* DISCONNECTED, INVALID_ARG, etc. */
+            return r;
        }

-        /* TIMEOUT branch — poll/sleep */
        if (timeout_ms == 0) return CUFRAMES_ERR_WOULD_BLOCK;
        if (timeout_ms > 0 && cuframes_now_ns() >= deadline_ns) {
            return CUFRAMES_ERR_TIMEOUT;
        }
-        struct timespec ts = {0, 1 * 1000 * 1000}; /* 1 ms poll interval */
+        struct timespec ts = {0, 1 * 1000 * 1000};
        nanosleep(&ts, NULL);
    }
 }
@@ -485,7 +558,7 @@ int cuframes_subscriber_next_packet(cuframes_subscriber_t *sub,
 int cuframes_subscriber_release_packet(cuframes_subscriber_t *sub,
                                        cuframes_packet_t *pkt) {
    if (!sub) return CUFRAMES_ERR_INVALID_ARG;
-    if (!pkt) return CUFRAMES_OK; /* NULL-safe */
+    if (!pkt) return CUFRAMES_OK;
    if (pkt != &sub->packet_obj) return CUFRAMES_ERR_INVALID_ARG;
    sub->packet_busy = 0;
    return CUFRAMES_OK;
@@ -499,7 +572,6 @@ int cuframes_subscriber_get_codec_params(cuframes_subscriber_t *sub,
    if (!sub->has_pkt_ring) return CUFRAMES_ERR_NO_PACKET_RING;
    cuframes_pkt_header_t *hdr = sub->pkt_ring.hdr;
    if (codec_id_out)       *codec_id_out = hdr->codec_id;
-    /* Если extradata ещё не выставлен publisher'ом — size=0, pointer ok но empty. */
    if (extradata_out)      *extradata_out = hdr->codec_extradata;
    if (extradata_size_out) *extradata_size_out = hdr->codec_extradata_size;
    if (hdr->codec_extradata_size == 0) return CUFRAMES_ERR_NO_CODEC_PARAMS;
@@ -8,6 +8,7 @@
 #define CUFRAMES_INTERNAL_H

 #define _GNU_SOURCE
+#include <cuda.h>          /* v0.4 — driver API: cuMemCreate/cuMemMap/cuMemExportToShareableHandle */
 #include <cuda_runtime.h>
 #include <pthread.h>
 #include <stdatomic.h>
@@ -21,9 +22,12 @@

 /* ─── Protocol constants ──────────────────────────────────────────────── */

-#define CUFRAMES_MAGIC          0xCC7C1DCCu
+#define CUFRAMES_MAGIC          0xCC7C1DCEu         /* v0.4 — bumped с 0xCC7C1DCC (full ABI break) */
+#define CUFRAMES_MAGIC_LEGACY   0xCC7C1DCCu         /* v0.1—v0.3 magic; ловится consumer'ом как clean PROTOCOL error */
 #define CUFRAMES_PROTOCOL_V1    1u
 #define CUFRAMES_PROTOCOL_V2    2u                  /* v0.2 — packet ring support */
+#define CUFRAMES_PROTOCOL_V3    3u                  /* v0.3 — per-slot CUDA events (deprecated; не работает без pid share) */
+#define CUFRAMES_PROTOCOL_V4    4u                  /* v0.4 — VMM + POSIX FD: pid/ipc namespace share не требуется */
 #define CUFRAMES_MAX_SUBSCRIBERS 32
 #define CUFRAMES_MAX_RING       16
 #define CUFRAMES_MAX_KEY_LEN    63
@@ -107,6 +111,11 @@ typedef struct __attribute__((packed)) cuframes_shm_header {
    /* offset 0x100 — variable-length tail */
    cuframes_shm_slot_t slots[CUFRAMES_MAX_RING];                /* 192 × 16 = 3072 */
    cuframes_shm_subscriber_t subscribers[CUFRAMES_MAX_SUBSCRIBERS]; /* 128 × 32 = 4096 */
+    /* v0.3 — per-slot CUDA event handles. Producer records event per publish;
+     * consumer waits event[slot_idx] specifically (не global ipc_event_handle
+     * который signals только для последнего published frame). Закрывает TOCTOU
+     * race в slot read. 64 × 16 = 1024 bytes. */
+    cudaIpcEventHandle_t slot_event_handles[CUFRAMES_MAX_RING];
 } cuframes_shm_header_t;

 /* Layout sanity checks (docs/protocol.md §2 table) */
@@ -198,6 +207,10 @@ typedef struct cuframes_pkt_ring {
 #define CUFRAMES_MSG_PING           0xF0
 #define CUFRAMES_MSG_PONG           0xF1
 #define CUFRAMES_MSG_ERROR          0xFE
+/* v0.4: после SUBSCRIBE_RESP publisher шлёт VMM_FDS с N posix FD handles в
+ * SCM_RIGHTS control. Payload: uint64_t slot_size + uint32_t fd_count +
+ * uint32_t reserved (для alignment). FDs приходят отдельным контрол-блоком. */
+#define CUFRAMES_MSG_VMM_FDS        0x05

 #define CUFRAMES_MAX_MSG_PAYLOAD 4096

@@ -231,6 +244,14 @@ typedef struct __attribute__((packed)) cuframes_msg_subscribe_resp {
    uint8_t  reserved[12];
 } cuframes_msg_subscribe_resp_t;

+/* v0.4: payload VMM_FDS message. Сами FDs идут в SCM_RIGHTS control-msg
+ * (см. cuframes_internal_send_msg_with_fds). */
+typedef struct __attribute__((packed)) cuframes_msg_vmm_fds {
+    uint64_t slot_size_bytes;   /* физический размер одного slot после round-up к granularity */
+    uint32_t fd_count;          /* должно совпадать с ring_size */
+    uint32_t reserved;
+} cuframes_msg_vmm_fds_t;
+
 /* ─── Logging (minimal — to stderr) ────────────────────────────────────── */

 #define CUFRAMES_LOG_ERROR(fmt, ...) \
@@ -266,6 +287,15 @@ int cuframes_internal_recv_msg(int sock_fd, uint32_t *msg_type_out,
                                void *payload, uint32_t *payload_len_inout,
                                int32_t timeout_ms);

+/* v0.4 — send/recv с FD-attached. Используется только для VMM_FDS message. */
+int cuframes_internal_send_msg_with_fds(int sock_fd, uint32_t msg_type,
+                                         const void *payload, uint32_t payload_len,
+                                         const int *fds, uint32_t fd_count);
+int cuframes_internal_recv_msg_with_fds(int sock_fd, uint32_t *msg_type_out,
+                                         void *payload, uint32_t *payload_len_inout,
+                                         int *fds_out, uint32_t *fd_count_inout,
+                                         int32_t timeout_ms);
+
 /* ─── Packet ring helpers (libcuframes/src/packet_ring.c) ─────────────── */

 /* Publisher: create SHM + initialize header + slots. Stale recovery как у frames. */
@@ -1,4 +1,14 @@
-/* Publisher implementation (docs/protocol.md §1, §2, §3.2, §4.2, §5). */
+/* Publisher implementation (docs/protocol.md §1, §2, §3.2, §4.2, §5).
+ *
+ * v0.4 — VMM + POSIX FD. Заменяет cudaMalloc+cudaIpcGetMemHandle на
+ * cuMemCreate + cuMemExportToShareableHandle(POSIX_FILE_DESCRIPTOR). FDs
+ * передаются consumer'у через SCM_RIGHTS, не нужны shared pid/ipc namespace.
+ *
+ * Sync (вместо cudaEventRecord+cudaIpcEventHandle): cuStreamSynchronize в
+ * do_publish — producer ждёт ~ms что stream flush'нулся, потом publishes seq.
+ * Consumer читает данные через DtoD копию без event wait — HW coherence
+ * гарантирована на одном GPU.
+ */

 #include "internal.h"
 #include <errno.h>
@@ -20,10 +30,18 @@ struct cuframes_publisher {
    char                socket_path[128];
    char                shm_name[80];

-    /* CUDA */
-    cudaEvent_t         event;
-    cudaIpcMemHandle_t  ipc_mem[CUFRAMES_MAX_RING];
-    void               *cuda_ptrs[CUFRAMES_MAX_RING];  /* mapped pointers */
+    /* v0.4 — VMM-allocated pool. Каждый slot: cuMemCreate → cuMemAddressReserve
+     * → cuMemMap → cuMemSetAccess. FD экспортируется один раз и передаётся всем
+     * subscribers через SCM_RIGHTS. */
+    CUmemGenericAllocationHandle vmm_handles[CUFRAMES_MAX_RING];
+    CUdeviceptr                  vmm_ptrs[CUFRAMES_MAX_RING];
+    int                          vmm_fds[CUFRAMES_MAX_RING];
+    size_t                       vmm_slot_size;   /* rounded к granularity */
+    int                          has_vmm_pool;
+
+    /* CUDA stream sync — заменяет per-slot events. Producer перед каждым publish
+     * вызывает cuStreamSynchronize чтобы гарантировать что previous writes
+     * завершены (data visible для consumer'ов на любом GPU stream). */
    size_t              frame_size_bytes;
    int32_t             ring_size_actual;

@@ -32,10 +50,6 @@ struct cuframes_publisher {
    int32_t             current_slot;  /* индекс slot'а полученного через acquire() */
    int                 has_acquired;

-    /* EXTERNAL ownership: map user pointer → ring index */
-    void               *external_ptrs[CUFRAMES_MAX_RING];
-    int32_t             external_count;
-
    /* Subscriber-management thread */
    pthread_t           accept_thread;
    int                 accept_thread_alive;
@@ -51,8 +65,16 @@ struct cuframes_publisher {
 /* Forward decls */
 static void *accept_thread_main(void *arg);
 static int handshake_subscriber(struct cuframes_publisher *pub, int client_fd);
+static void free_vmm_pool(struct cuframes_publisher *pub);

-/* ─── Internal: alloc/setup CUDA pool and SHM ─────────────────────────── */
+/* Helper: format CUresult error для CUFRAMES_LOG_ERROR */
+static const char *cu_err_str(CUresult r) {
+    const char *s = NULL;
+    cuGetErrorString(r, &s);
+    return s ? s : "?";
+}
+
+/* ─── Internal: alloc VMM pool + export POSIX FDs ─────────────────────── */

 static int alloc_library_pool(struct cuframes_publisher *pub) {
    int r = cuframes_internal_calc_size(pub->cfg.format,
@@ -61,7 +83,37 @@ static int alloc_library_pool(struct cuframes_publisher *pub) {
    if (r != CUFRAMES_OK) return r;

    pub->ring_size_actual = pub->cfg.ring_size;
+    for (int i = 0; i < CUFRAMES_MAX_RING; i++) pub->vmm_fds[i] = -1;

+    /* Initialize CUDA driver API context */
+    CUresult cr = cuInit(0);
+    if (cr != CUDA_SUCCESS) {
+        CUFRAMES_LOG_ERROR("cuInit: %s", cu_err_str(cr));
+        return CUFRAMES_ERR_CUDA;
+    }
+
+    /* Pick allocation prop: pinned device memory с POSIX FD handle */
+    CUmemAllocationProp prop = {0};
+    prop.type = CU_MEM_ALLOCATION_TYPE_PINNED;
+    prop.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
+    prop.location.id = pub->cfg.cuda_device;
+    prop.requestedHandleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR;
+
+    /* Round slot size up to granularity */
+    size_t granularity = 0;
+    cr = cuMemGetAllocationGranularity(&granularity, &prop,
+                                        CU_MEM_ALLOC_GRANULARITY_MINIMUM);
+    if (cr != CUDA_SUCCESS) {
+        CUFRAMES_LOG_ERROR("cuMemGetAllocationGranularity: %s", cu_err_str(cr));
+        return CUFRAMES_ERR_CUDA;
+    }
+    pub->vmm_slot_size = ((pub->frame_size_bytes + granularity - 1) / granularity)
+                         * granularity;
+    CUFRAMES_LOG_INFO("VMM granularity=%zu frame=%zu slot=%zu",
+                       granularity, pub->frame_size_bytes, pub->vmm_slot_size);
+
+    /* Required: also need a runtime API context so that cudaMemcpyAsync from
+     * user works on this allocation. cudaSetDevice достаточно. */
    cudaError_t cerr = cudaSetDevice(pub->cfg.cuda_device);
    if (cerr != cudaSuccess) {
        CUFRAMES_LOG_ERROR("cudaSetDevice(%d): %s",
@@ -69,59 +121,68 @@ static int alloc_library_pool(struct cuframes_publisher *pub) {
        return CUFRAMES_ERR_CUDA;
    }

+    CUmemAccessDesc access = {0};
+    access.location = prop.location;
+    access.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
+
    for (int i = 0; i < pub->ring_size_actual; ++i) {
-        cerr = cudaMalloc(&pub->cuda_ptrs[i], pub->frame_size_bytes);
-        if (cerr != cudaSuccess) {
-            CUFRAMES_LOG_ERROR("cudaMalloc slot %d: %s",
-                                i, cudaGetErrorString(cerr));
+        cr = cuMemCreate(&pub->vmm_handles[i], pub->vmm_slot_size, &prop, 0);
+        if (cr != CUDA_SUCCESS) {
+            CUFRAMES_LOG_ERROR("cuMemCreate slot %d: %s", i, cu_err_str(cr));
+            free_vmm_pool(pub);
            return CUFRAMES_ERR_CUDA;
        }
-        cerr = cudaIpcGetMemHandle(&pub->ipc_mem[i], pub->cuda_ptrs[i]);
-        if (cerr != cudaSuccess) {
-            CUFRAMES_LOG_ERROR("cudaIpcGetMemHandle slot %d: %s",
-                                i, cudaGetErrorString(cerr));
+        cr = cuMemAddressReserve(&pub->vmm_ptrs[i], pub->vmm_slot_size, 0, 0, 0);
+        if (cr != CUDA_SUCCESS) {
+            CUFRAMES_LOG_ERROR("cuMemAddressReserve slot %d: %s", i, cu_err_str(cr));
+            free_vmm_pool(pub);
+            return CUFRAMES_ERR_CUDA;
+        }
+        cr = cuMemMap(pub->vmm_ptrs[i], pub->vmm_slot_size, 0,
+                       pub->vmm_handles[i], 0);
+        if (cr != CUDA_SUCCESS) {
+            CUFRAMES_LOG_ERROR("cuMemMap slot %d: %s", i, cu_err_str(cr));
+            free_vmm_pool(pub);
+            return CUFRAMES_ERR_CUDA;
+        }
+        cr = cuMemSetAccess(pub->vmm_ptrs[i], pub->vmm_slot_size, &access, 1);
+        if (cr != CUDA_SUCCESS) {
+            CUFRAMES_LOG_ERROR("cuMemSetAccess slot %d: %s", i, cu_err_str(cr));
+            free_vmm_pool(pub);
+            return CUFRAMES_ERR_CUDA;
+        }
+        /* Export POSIX FD — будет shared с consumers через SCM_RIGHTS */
+        cr = cuMemExportToShareableHandle((void *)&pub->vmm_fds[i],
+                                           pub->vmm_handles[i],
+                                           CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR, 0);
+        if (cr != CUDA_SUCCESS) {
+            CUFRAMES_LOG_ERROR("cuMemExportToShareableHandle slot %d: %s",
+                                i, cu_err_str(cr));
+            free_vmm_pool(pub);
            return CUFRAMES_ERR_CUDA;
        }
    }
+    pub->has_vmm_pool = 1;
    return CUFRAMES_OK;
 }

-static int register_external_pool(struct cuframes_publisher *pub,
-                                   void *const *ptrs, int32_t count,
-                                   size_t frame_size) {
-    if (count < 1 || count > CUFRAMES_MAX_RING) return CUFRAMES_ERR_INVALID_ARG;
-    pub->frame_size_bytes = frame_size;
-    pub->ring_size_actual = count;
-    pub->external_count = count;
-
-    cudaError_t cerr = cudaSetDevice(pub->cfg.cuda_device);
-    if (cerr != cudaSuccess) {
-        CUFRAMES_LOG_ERROR("cudaSetDevice: %s", cudaGetErrorString(cerr));
-        return CUFRAMES_ERR_CUDA;
-    }
-    for (int i = 0; i < count; ++i) {
-        if (!ptrs[i]) return CUFRAMES_ERR_INVALID_ARG;
-        pub->cuda_ptrs[i] = ptrs[i];
-        pub->external_ptrs[i] = ptrs[i];
-        cerr = cudaIpcGetMemHandle(&pub->ipc_mem[i], ptrs[i]);
-        if (cerr != cudaSuccess) {
-            CUFRAMES_LOG_ERROR("cudaIpcGetMemHandle on external ptr %p: %s",
-                                ptrs[i], cudaGetErrorString(cerr));
-            return CUFRAMES_ERR_CUDA;
+static void free_vmm_pool(struct cuframes_publisher *pub) {
+    for (int i = 0; i < CUFRAMES_MAX_RING; i++) {
+        if (pub->vmm_fds[i] >= 0) {
+            close(pub->vmm_fds[i]);
+            pub->vmm_fds[i] = -1;
+        }
+        if (pub->vmm_ptrs[i]) {
+            cuMemUnmap(pub->vmm_ptrs[i], pub->vmm_slot_size);
+            cuMemAddressFree(pub->vmm_ptrs[i], pub->vmm_slot_size);
+            pub->vmm_ptrs[i] = 0;
+        }
+        if (pub->vmm_handles[i]) {
+            cuMemRelease(pub->vmm_handles[i]);
+            pub->vmm_handles[i] = 0;
        }
    }
-    return CUFRAMES_OK;
-}
-
-static int create_event_handle(struct cuframes_publisher *pub) {
-    cudaError_t cerr = cudaEventCreateWithFlags(&pub->event,
-        cudaEventDisableTiming | cudaEventInterprocess);
-    if (cerr != cudaSuccess) {
-        CUFRAMES_LOG_ERROR("cudaEventCreateWithFlags: %s",
-                            cudaGetErrorString(cerr));
-        return CUFRAMES_ERR_CUDA;
-    }
-    return CUFRAMES_OK;
+    pub->has_vmm_pool = 0;
 }

 static int setup_shm(struct cuframes_publisher *pub) {
@@ -139,7 +200,8 @@ static int setup_shm(struct cuframes_publisher *pub) {
                cuframes_shm_header_t tmp;
                ssize_t rb = read(existing, &tmp, sizeof(tmp));
                close(existing);
-                if (rb == (ssize_t)sizeof(tmp) && tmp.magic == CUFRAMES_MAGIC) {
+                if (rb == (ssize_t)sizeof(tmp) &&
+                    (tmp.magic == CUFRAMES_MAGIC || tmp.magic == CUFRAMES_MAGIC_LEGACY)) {
                    if (cuframes_internal_pid_alive((pid_t)tmp.producer_pid)) {
                        CUFRAMES_LOG_ERROR("publisher with key=%s already running (pid %lu)",
                            pub->key, (unsigned long)tmp.producer_pid);
@@ -172,7 +234,7 @@ static int setup_shm(struct cuframes_publisher *pub) {
    memset(pub->hdr, 0, sizeof(cuframes_shm_header_t));

    pub->hdr->magic = CUFRAMES_MAGIC;
-    pub->hdr->proto_version = CUFRAMES_PROTOCOL_V1;
+    pub->hdr->proto_version = CUFRAMES_PROTOCOL_V4;
    pub->hdr->lib_version_major = CUFRAMES_VERSION_MAJOR;
    pub->hdr->lib_version_minor = CUFRAMES_VERSION_MINOR;
    pub->hdr->lib_version_patch = CUFRAMES_VERSION_PATCH;
@@ -192,16 +254,11 @@ static int setup_shm(struct cuframes_publisher *pub) {
    pub->hdr->meta.pitch_uv = puv;
    pub->hdr->meta.frame_size_bytes = pub->frame_size_bytes;

-    /* Export event handle */
-    cudaError_t cerr = cudaIpcGetEventHandle(&pub->hdr->ipc_event_handle, pub->event);
-    if (cerr != cudaSuccess) {
-        CUFRAMES_LOG_ERROR("cudaIpcGetEventHandle: %s", cudaGetErrorString(cerr));
-        return CUFRAMES_ERR_CUDA;
-    }
-
-    /* Fill slot descriptors */
+    /* v0.4: legacy event fields в header не используются (cuStreamSynchronize
+     * заменяет IPC events). Memzero выше — достаточно. */
+    /* Slot descriptors — mem_handle поле deprecated (передаётся через FDs),
+     * только seq atomic нужен. */
    for (int i = 0; i < pub->ring_size_actual; ++i) {
-        pub->hdr->slots[i].mem_handle = pub->ipc_mem[i];
        atomic_store_explicit(&pub->hdr->slots[i].seq, UINT64_MAX,
                              memory_order_release);
    }
@@ -285,6 +342,7 @@ static int common_init(struct cuframes_publisher *pub,
    pub->next_seq = 0;
    pub->current_slot = -1;
    pub->has_acquired = 0;
+    for (int i = 0; i < CUFRAMES_MAX_RING; i++) pub->vmm_fds[i] = -1;
    pthread_mutex_init(&pub->state_mu, NULL);
    return CUFRAMES_OK;
 }
@@ -300,7 +358,6 @@ int cuframes_publisher_create(const cuframes_publisher_config_t *cfg,
    common_init(pub, cfg);

    if ((r = alloc_library_pool(pub)) != CUFRAMES_OK) goto fail;
-    if ((r = create_event_handle(pub)) != CUFRAMES_OK) goto fail;
    if ((r = setup_shm(pub)) != CUFRAMES_OK) goto fail;
    if ((r = setup_socket(pub)) != CUFRAMES_OK) goto fail;

@@ -312,7 +369,7 @@ int cuframes_publisher_create(const cuframes_publisher_config_t *cfg,
    }
    pub->accept_thread_alive = 1;

-    CUFRAMES_LOG_INFO("publisher '%s' ready (ring=%d, %dx%d, fmt=%d, lib-owned)",
+    CUFRAMES_LOG_INFO("publisher '%s' ready (ring=%d, %dx%d, fmt=%d, lib-owned, v0.4 VMM)",
                       pub->key, pub->ring_size_actual,
                       pub->cfg.width, pub->cfg.height, (int)pub->cfg.format);
    *out = pub;
@@ -328,37 +385,12 @@ int cuframes_publisher_create_external(const cuframes_publisher_config_t *cfg,
                                        int32_t ptr_count,
                                        size_t frame_size,
                                        cuframes_publisher_t **out) {
-    int r = validate_config(cfg);
-    if (r != CUFRAMES_OK) return r;
-    if (cfg->ownership != CUFRAMES_OWNERSHIP_EXTERNAL) return CUFRAMES_ERR_INVALID_ARG;
-    if (!cuda_ptrs || ptr_count < 1) return CUFRAMES_ERR_INVALID_ARG;
-    if (frame_size == 0) return CUFRAMES_ERR_INVALID_ARG;
-
-    struct cuframes_publisher *pub = calloc(1, sizeof(*pub));
-    if (!pub) return CUFRAMES_ERR_OUT_OF_MEMORY;
-    common_init(pub, cfg);
-
-    if ((r = register_external_pool(pub, cuda_ptrs, ptr_count, frame_size)) != CUFRAMES_OK)
-        goto fail;
-    if ((r = create_event_handle(pub)) != CUFRAMES_OK) goto fail;
-    if ((r = setup_shm(pub)) != CUFRAMES_OK) goto fail;
-    if ((r = setup_socket(pub)) != CUFRAMES_OK) goto fail;
-
-    pub->stop_flag = 0;
-    if (pthread_create(&pub->accept_thread, NULL, accept_thread_main, pub) != 0) {
-        r = CUFRAMES_ERR_INTERNAL;
-        goto fail;
-    }
-    pub->accept_thread_alive = 1;
-
-    CUFRAMES_LOG_INFO("publisher '%s' ready (external pool=%d, %dx%d, fmt=%d)",
-                       pub->key, ptr_count,
-                       pub->cfg.width, pub->cfg.height, (int)pub->cfg.format);
-    *out = pub;
-    return CUFRAMES_OK;
-fail:
-    cuframes_publisher_destroy(pub);
-    return r;
+    /* v0.4: external ownership больше не поддерживается. VMM API требует
+     * cuMemCreate-allocated memory; existing cudaMalloc-pointers нельзя
+     * export'нуть как POSIX FD. Use LIBRARY ownership. */
+    (void)cfg; (void)cuda_ptrs; (void)ptr_count; (void)frame_size; (void)out;
+    CUFRAMES_LOG_ERROR("EXTERNAL ownership не поддерживается в v0.4 (VMM-only)");
+    return CUFRAMES_ERR_INVALID_ARG;
 }

 int cuframes_publisher_acquire(cuframes_publisher_t *pub, void **cuda_ptr_out) {
@@ -379,27 +411,24 @@ int cuframes_publisher_acquire(cuframes_publisher_t *pub, void **cuda_ptr_out) {
            while (1) {
                uint64_t ack = atomic_load_explicit(&pub->hdr->slots[slot].ack_bitmap,
                                                     memory_order_acquire);
-                /* Если slot ещё не публикован (seq == UINT64_MAX) — пропустить ack check */
                uint64_t cur_seq = atomic_load_explicit(&pub->hdr->slots[slot].seq,
                                                         memory_order_acquire);
                if (cur_seq == UINT64_MAX || (ack & bitmap) == bitmap) break;
                if (deadline && cuframes_now_ns() > deadline) {
-                    /* Mark slow subscriber dead и continue */
                    uint64_t missing = bitmap & ~ack;
                    CUFRAMES_LOG_WARN("strict-wait timeout, slow subscribers bitmap=0x%lx",
                                       (unsigned long)missing);
-                    /* clear missing subscribers — TODO: send unsubscribe in v0.2 */
                    atomic_fetch_and_explicit(&pub->hdr->subscriber_bitmap,
                                               ~missing, memory_order_release);
                    break;
                }
-                struct timespec ts = {.tv_sec = 0, .tv_nsec = 100000}; /* 100µs */
+                struct timespec ts = {.tv_sec = 0, .tv_nsec = 100000};
                nanosleep(&ts, NULL);
            }
        }
    }

-    *cuda_ptr_out = pub->cuda_ptrs[slot];
+    *cuda_ptr_out = (void *)(uintptr_t)pub->vmm_ptrs[slot];
    pub->current_slot = slot;
    pub->has_acquired = 1;
    return CUFRAMES_OK;
@@ -407,10 +436,14 @@ int cuframes_publisher_acquire(cuframes_publisher_t *pub, void **cuda_ptr_out) {

 static int do_publish(cuframes_publisher_t *pub, int32_t slot,
                       void *stream, int64_t pts_ns) {
-    /* Record event on producer's stream */
-    cudaError_t cerr = cudaEventRecord(pub->event, (cudaStream_t)stream);
+    /* v0.4 — заменяет cudaEventRecord+IPC events на cuStreamSynchronize.
+     * Producer ждёт что stream flush'нулся (~1ms на 5090), потом publishes
+     * seq atomically. Consumer читает данные через DtoD memcpy без event
+     * wait — hardware coherence гарантирована на одном GPU. */
+    cudaError_t cerr = cudaStreamSynchronize((cudaStream_t)stream);
    if (cerr != cudaSuccess) {
-        CUFRAMES_LOG_ERROR("cudaEventRecord: %s", cudaGetErrorString(cerr));
+        CUFRAMES_LOG_ERROR("cudaStreamSynchronize (slot %d): %s",
+                            slot, cudaGetErrorString(cerr));
        return CUFRAMES_ERR_CUDA;
    }

@@ -443,44 +476,8 @@ int cuframes_publisher_publish(cuframes_publisher_t *pub, void *stream, int64_t

 int cuframes_publisher_publish_external(cuframes_publisher_t *pub,
                                         void *cuda_ptr, void *stream, int64_t pts_ns) {
-    if (!pub || !cuda_ptr) return CUFRAMES_ERR_INVALID_ARG;
-    if (pub->cfg.ownership != CUFRAMES_OWNERSHIP_EXTERNAL) return CUFRAMES_ERR_INVALID_ARG;
-
-    int32_t slot = -1;
-    for (int i = 0; i < pub->external_count; ++i) {
-        if (pub->external_ptrs[i] == cuda_ptr) { slot = i; break; }
-    }
-    if (slot < 0) {
-        CUFRAMES_LOG_ERROR("external pointer %p not registered", cuda_ptr);
-        return CUFRAMES_ERR_INVALID_ARG;
-    }
-
-    /* STRICT_WAIT — то же что в acquire, но per-publish */
-    if (pub->cfg.policy == CUFRAMES_POLICY_STRICT_WAIT) {
-        uint64_t bitmap = atomic_load_explicit(&pub->hdr->subscriber_bitmap,
-                                                memory_order_acquire);
-        if (bitmap != 0) {
-            int64_t deadline = pub->cfg.consumer_ack_timeout_ms > 0
-                ? cuframes_now_ns() + (int64_t)pub->cfg.consumer_ack_timeout_ms * 1000000LL
-                : 0;
-            while (1) {
-                uint64_t ack = atomic_load_explicit(&pub->hdr->slots[slot].ack_bitmap,
-                                                     memory_order_acquire);
-                uint64_t cur_seq = atomic_load_explicit(&pub->hdr->slots[slot].seq,
-                                                         memory_order_acquire);
-                if (cur_seq == UINT64_MAX || (ack & bitmap) == bitmap) break;
-                if (deadline && cuframes_now_ns() > deadline) {
-                    uint64_t missing = bitmap & ~ack;
-                    atomic_fetch_and_explicit(&pub->hdr->subscriber_bitmap,
-                                               ~missing, memory_order_release);
-                    break;
-                }
-                struct timespec ts = {.tv_sec = 0, .tv_nsec = 100000};
-                nanosleep(&ts, NULL);
-            }
-        }
-    }
-    return do_publish(pub, slot, stream, pts_ns);
+    (void)pub; (void)cuda_ptr; (void)stream; (void)pts_ns;
+    return CUFRAMES_ERR_INVALID_ARG;  /* v0.4 — нет external mode */
 }

 int cuframes_publisher_destroy(cuframes_publisher_t *pub) {
@@ -502,13 +499,10 @@ int cuframes_publisher_destroy(cuframes_publisher_t *pub) {
        pub->accept_thread_alive = 0;
    }

-    /* Free CUDA */
-    if (pub->cfg.ownership == CUFRAMES_OWNERSHIP_LIBRARY) {
-        for (int i = 0; i < pub->ring_size_actual; ++i) {
-            if (pub->cuda_ptrs[i]) cudaFree(pub->cuda_ptrs[i]);
-        }
+    /* Free VMM */
+    if (pub->has_vmm_pool) {
+        free_vmm_pool(pub);
    }
-    if (pub->event) cudaEventDestroy(pub->event);

    /* Packet ring cleanup (если активирован) */
    if (pub->has_pkt_ring) {
@@ -562,11 +556,7 @@ int cuframes_publisher_enable_packets(cuframes_publisher_t *pub,

    pub->has_pkt_ring = 1;
    pub->max_packet_size = max_pkt;
-
-    /* Bump proto_version в frames header чтобы v2-subscribers видели поддержку. */
-    if (pub->hdr) {
-        pub->hdr->proto_version = CUFRAMES_PROTOCOL_V2;
-    }
+    /* v0.4 frame header proto не bumped из-за packet ring — оба коэкзистируют. */
    return CUFRAMES_OK;
 }

@@ -591,6 +581,30 @@ int cuframes_publisher_publish_packet(cuframes_publisher_t *pub,

 /* ─── Accept thread + handshake ──────────────────────────────────────── */

+struct sub_monitor_args {
+    struct cuframes_publisher *pub;
+    int fd;
+    uint32_t bit;
+};
+
+static void *subscriber_monitor_thread(void *arg) {
+    struct sub_monitor_args *m = (struct sub_monitor_args *)arg;
+    char buf[64];
+    while (1) {
+        ssize_t n = recv(m->fd, buf, sizeof(buf), 0);
+        if (n <= 0) {
+            atomic_fetch_and_explicit(&m->pub->hdr->subscriber_bitmap,
+                ~(1ULL << m->bit), memory_order_release);
+            atomic_store_explicit(&m->pub->hdr->subscribers[m->bit].state, 0,
+                memory_order_release);
+            close(m->fd);
+            CUFRAMES_LOG_INFO("subscriber bit=%u disconnected — freed", m->bit);
+            free(m);
+            return NULL;
+        }
+    }
+}
+
 static void *accept_thread_main(void *arg) {
    struct cuframes_publisher *pub = (struct cuframes_publisher *)arg;
    while (!pub->stop_flag) {
@@ -603,21 +617,16 @@ static void *accept_thread_main(void *arg) {
            CUFRAMES_LOG_WARN("accept: %s", strerror(errno));
            continue;
        }
-        /* Synchronous handshake — после ответа socket остаётся открытым для
-         * lifetime signals (SHUTDOWN, PING). Close на error. */
        int r = handshake_subscriber(pub, client);
        if (r != CUFRAMES_OK) {
            close(client);
        }
-        /* TODO v0.2: track client fds для broadcast SHUTDOWN. Сейчас clients
-         * сами detect socket EOF при publisher_destroy через shutdown(). */
    }
    return NULL;
 }

 static int allocate_subscriber_bit(struct cuframes_publisher *pub,
                                    const char *name, uint32_t *bit_out) {
-    /* Bit 0 reserved (sentinel). Bits 1..31. */
    pthread_mutex_lock(&pub->state_mu);
    for (uint32_t bit = 1; bit < CUFRAMES_MAX_SUBSCRIBERS; ++bit) {
        uint64_t state = atomic_load_explicit(&pub->hdr->subscribers[bit].state,
@@ -637,7 +646,6 @@ static int allocate_subscriber_bit(struct cuframes_publisher *pub,
            pthread_mutex_unlock(&pub->state_mu);
            return CUFRAMES_OK;
        }
-        /* Check for name collision */
        if (name && state >= 2 &&
            strncmp(pub->hdr->subscribers[bit].consumer_name, name,
                    sizeof(pub->hdr->subscribers[bit].consumer_name)) == 0) {
@@ -664,7 +672,6 @@ static int handshake_subscriber(struct cuframes_publisher *pub, int client_fd) {
        return CUFRAMES_ERR_PROTOCOL;
    }

-    /* Parse HELLO_REQ: proto_version + name_len + name + cuda_device + mode */
    if (plen < sizeof(cuframes_msg_hello_req_t) + 20) return CUFRAMES_ERR_PROTOCOL;
    cuframes_msg_hello_req_t *hreq = (cuframes_msg_hello_req_t *)buf;
    uint32_t want_proto = hreq->proto_version;
@@ -674,18 +681,18 @@ static int handshake_subscriber(struct cuframes_publisher *pub, int client_fd) {
    char name[32] = {0};
    memcpy(name, buf + sizeof(*hreq), name_len);

-    int proto_match = (want_proto == CUFRAMES_PROTOCOL_V1);
+    /* v0.4 принимает только V4 consumers. Старые v0.3 fail здесь cleanly. */
+    int proto_match = (want_proto == CUFRAMES_PROTOCOL_V4);

    /* Send HELLO_RESP */
    uint8_t resp_buf[CUFRAMES_MAX_MSG_PAYLOAD];
    cuframes_msg_hello_resp_t *resp = (cuframes_msg_hello_resp_t *)resp_buf;
    memset(resp, 0, sizeof(*resp));
    resp->result = proto_match ? CUFRAMES_OK : CUFRAMES_ERR_PROTOCOL;
-    resp->proto_version_actual = CUFRAMES_PROTOCOL_V1;
+    resp->proto_version_actual = CUFRAMES_PROTOCOL_V4;
    resp->ring_size = (uint32_t)pub->ring_size_actual;
    resp->ownership_mode = (uint32_t)pub->cfg.ownership;
    resp->meta = pub->hdr->meta;
-    /* shm_path */
    int slen = snprintf((char *)(resp_buf + sizeof(*resp)),
                         sizeof(resp_buf) - sizeof(*resp) - 12,
                         "%s", pub->shm_name);
@@ -698,7 +705,11 @@ static int handshake_subscriber(struct cuframes_publisher *pub, int client_fd) {
        CUFRAMES_LOG_WARN("send HELLO_RESP: %s", cuframes_strerror(r));
        return r;
    }
-    if (!proto_match) return CUFRAMES_ERR_PROTOCOL;
+    if (!proto_match) {
+        CUFRAMES_LOG_WARN("subscriber proto v%u rejected (want v%u)",
+                          want_proto, CUFRAMES_PROTOCOL_V4);
+        return CUFRAMES_ERR_PROTOCOL;
+    }

    /* recv SUBSCRIBE_REQ */
    plen = sizeof(buf);
@@ -706,11 +717,9 @@ static int handshake_subscriber(struct cuframes_publisher *pub, int client_fd) {
    if (r != CUFRAMES_OK) return r;
    if (mtype != CUFRAMES_MSG_SUBSCRIBE_REQ) return CUFRAMES_ERR_PROTOCOL;

-    /* Allocate subscriber bit */
    uint32_t bit = 0;
    int alloc_r = allocate_subscriber_bit(pub, name, &bit);

-    /* Send SUBSCRIBE_RESP */
    cuframes_msg_subscribe_resp_t sresp = {0};
    sresp.result = alloc_r;
    sresp.assigned_bit = bit;
@@ -721,13 +730,42 @@ static int handshake_subscriber(struct cuframes_publisher *pub, int client_fd) {
                                     &sresp, sizeof(sresp));
    if (r != CUFRAMES_OK || alloc_r != CUFRAMES_OK) return r ? r : alloc_r;

-    /* Activate subscriber slot */
+    /* v0.4 — отправить VMM_FDS с N posix FDs через SCM_RIGHTS */
+    cuframes_msg_vmm_fds_t vmm_payload = {0};
+    vmm_payload.slot_size_bytes = pub->vmm_slot_size;
+    vmm_payload.fd_count = (uint32_t)pub->ring_size_actual;
+    r = cuframes_internal_send_msg_with_fds(client_fd, CUFRAMES_MSG_VMM_FDS,
+                                             &vmm_payload, sizeof(vmm_payload),
+                                             pub->vmm_fds,
+                                             (uint32_t)pub->ring_size_actual);
+    if (r != CUFRAMES_OK) {
+        CUFRAMES_LOG_WARN("send VMM_FDS: %s", cuframes_strerror(r));
+        /* roll back bit allocation */
+        atomic_fetch_and_explicit(&pub->hdr->subscriber_bitmap,
+            ~(1ULL << bit), memory_order_release);
+        atomic_store_explicit(&pub->hdr->subscribers[bit].state, 0,
+            memory_order_release);
+        return r;
+    }
+
    atomic_store_explicit(&pub->hdr->subscribers[bit].state, 2,
                           memory_order_release);

-    CUFRAMES_LOG_INFO("subscriber '%s' connected (bit=%u)", name, bit);
+    CUFRAMES_LOG_INFO("subscriber '%s' connected (bit=%u, %d VMM FDs)",
+                      name, bit, pub->ring_size_actual);

-    /* TODO v0.2: spawn per-client thread для liveness/PING/UNSUBSCRIBE.
-     * Сейчас socket остаётся открытым на heap'е до publisher_destroy. */
+    /* Spawn monitor thread */
+    struct sub_monitor_args *m = malloc(sizeof(*m));
+    if (!m) return CUFRAMES_OK;
+    m->pub = pub;
+    m->fd = client_fd;
+    m->bit = bit;
+    pthread_t monitor_tid;
+    if (pthread_create(&monitor_tid, NULL, subscriber_monitor_thread, m) != 0) {
+        CUFRAMES_LOG_WARN("monitor pthread_create fail — bit %u may leak", bit);
+        free(m);
+    } else {
+        pthread_detach(monitor_tid);
+    }
    return CUFRAMES_OK;
 }
@@ -3,7 +3,9 @@
 #include "internal.h"
 #include <errno.h>
 #include <poll.h>
+#include <string.h>
 #include <sys/socket.h>
+#include <sys/uio.h>
 #include <unistd.h>

 /* Read exactly N bytes from socket, with poll-based timeout. */
@@ -97,3 +99,121 @@ int cuframes_internal_recv_msg(int fd, uint32_t *msg_type_out,
    if (payload_len_inout) *payload_len_inout = h.payload_length;
    return CUFRAMES_OK;
 }
+
+/* v0.4 — send TLV msg + N FDs через SCM_RIGHTS. Один sendmsg(): header+payload
+ * в iovec, FDs в control. Header.payload_length описывает ТОЛЬКО payload bytes,
+ * FDs приходят out-of-band. */
+int cuframes_internal_send_msg_with_fds(int sock_fd, uint32_t msg_type,
+                                         const void *payload, uint32_t payload_len,
+                                         const int *fds, uint32_t fd_count) {
+    if (payload_len > CUFRAMES_MAX_MSG_PAYLOAD) return CUFRAMES_ERR_INVALID_ARG;
+    if (fd_count > 0 && !fds) return CUFRAMES_ERR_INVALID_ARG;
+
+    cuframes_msg_header_t h = {.msg_type = msg_type, .payload_length = payload_len};
+
+    struct iovec iov[2];
+    iov[0].iov_base = &h;             iov[0].iov_len = sizeof(h);
+    iov[1].iov_base = (void *)payload; iov[1].iov_len = payload_len;
+
+    struct msghdr msg = {0};
+    msg.msg_iov = iov;
+    msg.msg_iovlen = (payload_len > 0 && payload) ? 2 : 1;
+
+    char ctrl_buf[CMSG_SPACE(sizeof(int) * 64)] = {0};
+    if (fd_count > 0) {
+        if (fd_count > 64) return CUFRAMES_ERR_INVALID_ARG;
+        msg.msg_control = ctrl_buf;
+        msg.msg_controllen = CMSG_SPACE(sizeof(int) * fd_count);
+        struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
+        cmsg->cmsg_level = SOL_SOCKET;
+        cmsg->cmsg_type = SCM_RIGHTS;
+        cmsg->cmsg_len = CMSG_LEN(sizeof(int) * fd_count);
+        memcpy(CMSG_DATA(cmsg), fds, sizeof(int) * fd_count);
+    }
+
+    ssize_t n = sendmsg(sock_fd, &msg, MSG_NOSIGNAL);
+    if (n < 0) {
+        if (errno == EPIPE) return CUFRAMES_ERR_DISCONNECTED;
+        return CUFRAMES_ERR_IO;
+    }
+    /* Partial send rare для small payload — но обработаем gracefully */
+    size_t want = sizeof(h) + payload_len;
+    if ((size_t)n < want) {
+        return send_all(sock_fd, (uint8_t *)iov[0].iov_base + n,
+                        want - (size_t)n);
+    }
+    return CUFRAMES_OK;
+}
+
+int cuframes_internal_recv_msg_with_fds(int sock_fd, uint32_t *msg_type_out,
+                                         void *payload, uint32_t *payload_len_inout,
+                                         int *fds_out, uint32_t *fd_count_inout,
+                                         int32_t timeout_ms) {
+    /* Poll первым делом — recvmsg блокирующий, иначе тайм-аут не сработает. */
+    if (timeout_ms >= 0) {
+        struct pollfd pfd = {.fd = sock_fd, .events = POLLIN};
+        int pr = poll(&pfd, 1, timeout_ms);
+        if (pr == 0) return CUFRAMES_ERR_TIMEOUT;
+        if (pr < 0) return CUFRAMES_ERR_IO;
+    }
+
+    cuframes_msg_header_t h;
+    struct iovec iov[2];
+    iov[0].iov_base = &h;     iov[0].iov_len = sizeof(h);
+    iov[1].iov_base = payload; iov[1].iov_len = (payload && payload_len_inout) ? *payload_len_inout : 0;
+
+    uint32_t want_fds = (fd_count_inout && fds_out) ? *fd_count_inout : 0;
+    char ctrl_buf[CMSG_SPACE(sizeof(int) * 64)] = {0};
+    struct msghdr msg = {0};
+    msg.msg_iov = iov;
+    msg.msg_iovlen = (iov[1].iov_len > 0) ? 2 : 1;
+    msg.msg_control = ctrl_buf;
+    msg.msg_controllen = sizeof(ctrl_buf);
+
+    ssize_t n = recvmsg(sock_fd, &msg, 0);
+    if (n == 0) return CUFRAMES_ERR_DISCONNECTED;
+    if (n < 0) return CUFRAMES_ERR_IO;
+    if ((size_t)n < sizeof(h)) return CUFRAMES_ERR_PROTOCOL;
+
+    if (msg_type_out) *msg_type_out = h.msg_type;
+    if (h.payload_length > CUFRAMES_MAX_MSG_PAYLOAD) return CUFRAMES_ERR_PROTOCOL;
+
+    /* Если recvmsg вернул меньше payload_length — добираем через recv_all */
+    size_t got_payload = (size_t)n - sizeof(h);
+    if (h.payload_length > 0) {
+        if (!payload || !payload_len_inout || *payload_len_inout < h.payload_length) {
+            return CUFRAMES_ERR_INVALID_ARG;
+        }
+        if (got_payload < h.payload_length) {
+            int r = recv_all(sock_fd, (uint8_t *)payload + got_payload,
+                              h.payload_length - got_payload, timeout_ms);
+            if (r != CUFRAMES_OK) return r;
+        }
+        *payload_len_inout = h.payload_length;
+    } else if (payload_len_inout) {
+        *payload_len_inout = 0;
+    }
+
+    /* Parse SCM_RIGHTS FDs */
+    uint32_t got_fds = 0;
+    struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
+    for (; cmsg; cmsg = CMSG_NXTHDR(&msg, cmsg)) {
+        if (cmsg->cmsg_level == SOL_SOCKET && cmsg->cmsg_type == SCM_RIGHTS) {
+            size_t blob = cmsg->cmsg_len - CMSG_LEN(0);
+            uint32_t n_fds = (uint32_t)(blob / sizeof(int));
+            if (got_fds + n_fds > want_fds) {
+                /* Close excess FDs чтобы не утекли */
+                for (uint32_t i = 0; i < n_fds; i++) {
+                    int f;
+                    memcpy(&f, CMSG_DATA(cmsg) + i * sizeof(int), sizeof(int));
+                    close(f);
+                }
+                continue;
+            }
+            memcpy(fds_out + got_fds, CMSG_DATA(cmsg), blob);
+            got_fds += n_fds;
+        }
+    }
+    if (fd_count_inout) *fd_count_inout = got_fds;
+    return CUFRAMES_OK;
+}
@@ -0,0 +1,7 @@
+build/
+dist/
+*.egg-info/
+__pycache__/
+*.pyc
+*.so
+.pytest_cache/
@@ -0,0 +1,52 @@
+# Python bindings for cuframes — pybind11 module.
+#
+# Buildup: используется как subdirectory из root CMakeLists.txt при
+# BUILD_PYTHON_BINDINGS=ON, либо standalone через scikit-build-core
+# (см. pyproject.toml).
+#
+# Output: единый shared module `_native.so` который импортируется из
+# Python package `cuframes` (cuframes/__init__.py re-export'ит публичный API).
+
+include(FetchContent)
+
+# pybind11 — header-only + helper functions. FetchContent чтобы не требовать
+# system install; pinned tag для воспроизводимых билдов.
+FetchContent_Declare(
+    pybind11
+    GIT_REPOSITORY https://github.com/pybind/pybind11.git
+    GIT_TAG v2.13.6
+    GIT_SHALLOW TRUE
+)
+FetchContent_MakeAvailable(pybind11)
+
+pybind11_add_module(_native MODULE
+    src/_native.cpp
+)
+
+target_include_directories(_native PRIVATE
+    ${PROJECT_SOURCE_DIR}/include
+)
+
+target_link_libraries(_native PRIVATE
+    cuframes  # imported target из libcuframes/CMakeLists.txt
+)
+
+# Версия модуля соответствует libcuframes (см. cuframes.h)
+target_compile_definitions(_native PRIVATE
+    CUFRAMES_PY_BINDING_VERSION="${PROJECT_VERSION}"
+)
+
+set_target_properties(_native PROPERTIES
+    CXX_STANDARD 17
+    CXX_STANDARD_REQUIRED ON
+    CXX_VISIBILITY_PRESET hidden
+    INTERPROCEDURAL_OPTIMIZATION TRUE
+)
+
+# При scikit-build-core билде модуль попадает в wheel рядом с Python-исходниками
+# пакета. При standalone CMake — устанавливается в site-packages по умолчанию.
+if(SKBUILD)
+    install(TARGETS _native DESTINATION cuframes)
+else()
+    install(TARGETS _native LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR}/cuframes)
+endif()
@@ -0,0 +1,53 @@
+# cuframes — Python bindings
+
+Status: **WIP** (Phase 0 skeleton — issue [gx/cuframes#6](http://server:3000/gx/cuframes/issues/6))
+
+Это пакет Python-обёрток над `libcuframes` (C ABI). Цель — позволить
+downstream ML/CV пайплайнам (yolo-world-detector, zone-motion, custom
+скриптам) подписываться на cuframes без CPU round-trip: получать NV12
+frames прямо как CUDA pointer / `torch.Tensor` (DLPack export, zero-copy).
+
+## Текущий статус (что уже работает в этом skeleton)
+
+- Module import: `import cuframes` загружает `_native.so`
+- Версия: `cuframes.version_string()`, `cuframes.protocol_version()`
+- Enums: `PixelFormat`, `SubscriberMode`
+- Иерархия исключений: `CuframesError` + 8 subclasses (publisher gone,
+  frame timeout, device lost, и т. д.)
+
+## Что в работе (см. tasks #198-#202)
+
+- [ ] `CuframesSubscriber` + `CuframesFrame` lifecycle
+- [ ] DLPack export → `torch.from_dlpack`, `cupy.from_dlpack`
+- [ ] Context manager (`with cuframes.subscribe(key) as sub:`)
+- [ ] Per-subscriber CUDA stream
+- [ ] Health/stats properties (`ring_occupancy`, `drop_count`)
+- [ ] Thread-safety contract документация
+
+## Build (dev)
+
+Standalone wheel:
+
+```bash
+cd python/
+pip install -e . --no-build-isolation
+```
+
+Через корневой CMake-проект (вместе с libcuframes):
+
+```bash
+cmake -B build -DBUILD_PYTHON_BINDINGS=ON
+cmake --build build -j
+```
+
+## Зависимости
+
+- `libcuframes` ≥ 0.4 (линкуется из соседнего CMake target)
+- CUDA Toolkit 12+
+- `pybind11` 2.13+ (берётся через FetchContent при CMake-сборке)
+- Python 3.10+
+- Опционально: `torch>=2.4` или `cupy-cuda12x>=13` для DLPack-потребителей
+
+## Лицензия
+
+LGPL-2.1+ (как у libcuframes).
@@ -0,0 +1,77 @@
+"""cuframes — zero-copy CUDA frame sharing.
+
+Python bindings to libcuframes. См. docs/python.md (т.б.д.) для
+архитектуры, threading контракта и примеров интеграции с PyTorch/CuPy.
+
+Пример (subscriber-side):
+
+    import cuframes
+
+    with cuframes.subscribe("cam-parking",
+                            consumer_name="yolo-world",
+                            connect_timeout_ms=5000) as sub:
+        # next_frame returns CuframesFrame — context manager
+        with sub.next_frame(timeout_ms=1000) as frame:
+            print(frame.cuda_ptr, frame.width, frame.height,
+                  frame.pitch_y, frame.seq, frame.pts_ns)
+            # DLPack export — в task #199, пока через cuda-python:
+            # cuda_arr = cuda.from_pointer(frame.cuda_ptr, ...)
+
+Reconnect-loop пример:
+
+    while True:
+        try:
+            with cuframes.subscribe("cam-parking", connect_timeout_ms=5000) as sub:
+                while True:
+                    try:
+                        with sub.next_frame(timeout_ms=1000) as frame:
+                            process(frame)
+                    except cuframes.CuframesFrameTimeout:
+                        continue  # просто нет новых кадров
+        except cuframes.CuframesPublisherGone:
+            time.sleep(1)  # publisher restart — переподписываемся
+"""
+
+from ._native import (
+    # Метаданные
+    version_string,
+    protocol_version,
+    # Enums
+    PixelFormat,
+    SubscriberMode,
+    # Core API
+    CuframesSubscriber,
+    CuframesFrame,
+    subscribe,
+    # Error taxonomy
+    CuframesError,
+    CuframesPublisherGone,
+    CuframesFrameTimeout,
+    CuframesDeviceLost,
+    CuframesShmError,
+    CuframesProtocolMismatch,
+    CuframesInvalidArgument,
+    CuframesOutOfMemory,
+    CuframesInternal,
+)
+
+__version__ = version_string()
+
+__all__ = [
+    "version_string",
+    "protocol_version",
+    "PixelFormat",
+    "SubscriberMode",
+    "CuframesSubscriber",
+    "CuframesFrame",
+    "subscribe",
+    "CuframesError",
+    "CuframesPublisherGone",
+    "CuframesFrameTimeout",
+    "CuframesDeviceLost",
+    "CuframesShmError",
+    "CuframesProtocolMismatch",
+    "CuframesInvalidArgument",
+    "CuframesOutOfMemory",
+    "CuframesInternal",
+]
@@ -0,0 +1,47 @@
+[build-system]
+requires = [
+    "scikit-build-core>=0.10",
+    "pybind11>=2.13",
+]
+build-backend = "scikit_build_core.build"
+
+[project]
+name = "cuframes"
+version = "0.4.0"
+description = "Python bindings for cuframes — zero-copy CUDA frame sharing"
+readme = "README.md"
+license = { text = "LGPL-2.1+" }
+requires-python = ">=3.10"
+authors = [{ name = "Evgeny Demchenko", email = "demchenkoev@gmail.com" }]
+keywords = ["cuda", "video", "ipc", "zero-copy"]
+classifiers = [
+    "Development Status :: 3 - Alpha",
+    "Intended Audience :: Developers",
+    "License :: OSI Approved :: GNU Lesser General Public License v2 or later (LGPLv2+)",
+    "Programming Language :: Python :: 3",
+    "Programming Language :: Python :: 3.10",
+    "Programming Language :: Python :: 3.11",
+    "Programming Language :: Python :: 3.12",
+    "Topic :: Multimedia :: Video",
+]
+
+[project.optional-dependencies]
+torch = ["torch>=2.4"]
+cupy = ["cupy-cuda12x>=13"]
+dev = ["pytest>=8", "ruff>=0.6"]
+
+[tool.scikit-build]
+cmake.version = ">=3.20"
+cmake.build-type = "Release"
+build-dir = "build/{wheel_tag}"
+wheel.packages = ["cuframes"]
+# Будем строить только Python модуль; libcuframes собирается отдельно
+# в основном CMake-проекте и линкуется как imported target.
+cmake.args = ["-DBUILD_PYTHON_BINDINGS=ON", "-DBUILD_EXAMPLES=OFF", "-DBUILD_TOOLS=OFF"]
+cmake.source-dir = ".."
+
+[tool.scikit-build.cmake.define]
+BUILD_PYTHON_BINDINGS = "ON"
+
+[tool.pytest.ini_options]
+testpaths = ["tests"]
@@ -0,0 +1,757 @@
+// cuframes Python bindings — pybind11 entry point.
+//
+// Этот файл реализует core wrapper для subscriber-side API:
+//   - CuframesFrame   — owning handle одного frame'а, context manager
+//   - CuframesSubscriber — owning handle subscription'а, context manager
+//
+// DLPack export (#199), per-subscriber CUDA stream (#201), health/stats props
+// (#200) — добавляются в последующих коммитах в этот же файл.
+//
+// Контракт thread-safety (предварительный, финальный — task #202):
+//   - Каждый handle (CuframesSubscriber / CuframesFrame) принадлежит одному
+//     Python потоку. Cross-thread access = undefined behavior на C-уровне.
+//   - GIL отпускается на длинных I/O вызовах (next_frame) — другие Python
+//     потоки могут работать пока мы ждём frame.
+
+#include <pybind11/pybind11.h>
+#include <pybind11/stl.h>
+#include <cstring>
+#include <optional>
+#include <stdexcept>
+#include <string>
+
+#include "cuframes/cuframes.h"
+
+// DLPack — стандартный protocol для exchange tensor-like структур между
+// фреймворками (PyTorch/CuPy/JAX/TF). См. https://dmlc.github.io/dlpack/latest/
+// Мы embedим header inline чтобы не добавлять external dep — header
+// небольшой и стабильный (DLPack 1.0+).
+namespace dlpack {
+
+typedef enum {
+    kDLCPU = 1,
+    kDLCUDA = 2,
+} DLDeviceType;
+
+typedef struct {
+    DLDeviceType device_type;
+    int32_t device_id;
+} DLDevice;
+
+typedef enum {
+    kDLInt = 0U,
+    kDLUInt = 1U,
+    kDLFloat = 2U,
+} DLDataTypeCode;
+
+typedef struct {
+    uint8_t code;
+    uint8_t bits;
+    uint16_t lanes;
+} DLDataType;
+
+typedef struct {
+    void* data;
+    DLDevice device;
+    int32_t ndim;
+    DLDataType dtype;
+    int64_t* shape;
+    int64_t* strides;
+    uint64_t byte_offset;
+} DLTensor;
+
+typedef struct DLManagedTensor {
+    DLTensor dl_tensor;
+    void* manager_ctx;
+    void (*deleter)(struct DLManagedTensor* self);
+} DLManagedTensor;
+
+}  // namespace dlpack
+
+namespace py = pybind11;
+
+namespace {
+
+// ─────────────────────────────────────────────────────────────────────────────
+// Error taxonomy — Python exceptions, соответствующие cuframes_error_t.
+//
+// Принцип: каждая категория ошибок которая требует разной обработки в
+// downstream'е (reconnect vs retry vs fatal) → отдельный exception class.
+// Это решает требование из architect review: «detector должен уметь
+// reconnect-loop по publisher-gone, не падать».
+// ─────────────────────────────────────────────────────────────────────────────
+
+struct CuframesExceptions {
+    py::object base;
+    py::object publisher_gone;     // CUFRAMES_ERR_DISCONNECTED, _NOT_FOUND
+    py::object frame_timeout;      // CUFRAMES_ERR_TIMEOUT, _WOULD_BLOCK
+    py::object device_lost;        // CUFRAMES_ERR_CUDA
+    py::object shm_error;          // CUFRAMES_ERR_IO
+    py::object protocol_mismatch;  // CUFRAMES_ERR_PROTOCOL
+    py::object invalid_argument;   // CUFRAMES_ERR_INVALID_ARG
+    py::object out_of_memory;      // CUFRAMES_ERR_OUT_OF_MEMORY
+    py::object internal;           // CUFRAMES_ERR_INTERNAL, прочее
+};
+
+CuframesExceptions g_exc;
+
+// Маппинг cuframes_error_t → подходящий Python exception class.
+py::object exception_for(int err) {
+    switch (err) {
+        case CUFRAMES_ERR_NOT_FOUND:
+        case CUFRAMES_ERR_DISCONNECTED:
+            return g_exc.publisher_gone;
+        case CUFRAMES_ERR_TIMEOUT:
+        case CUFRAMES_ERR_WOULD_BLOCK:
+            return g_exc.frame_timeout;
+        case CUFRAMES_ERR_CUDA:
+            return g_exc.device_lost;
+        case CUFRAMES_ERR_IO:
+            return g_exc.shm_error;
+        case CUFRAMES_ERR_PROTOCOL:
+            return g_exc.protocol_mismatch;
+        case CUFRAMES_ERR_INVALID_ARG:
+            return g_exc.invalid_argument;
+        case CUFRAMES_ERR_OUT_OF_MEMORY:
+            return g_exc.out_of_memory;
+        default:
+            return g_exc.internal;
+    }
+}
+
+// Бросает подходящий exception если err != CUFRAMES_OK.
+void check(int err, const char* operation = nullptr) {
+    if (err == CUFRAMES_OK) return;
+    const char* msg = cuframes_strerror(err);
+    std::string what = operation
+        ? std::string(operation) + ": " + msg + " (code=" + std::to_string(err) + ")"
+        : std::string(msg) + " (code=" + std::to_string(err) + ")";
+    PyErr_SetString(exception_for(err).ptr(), what.c_str());
+    throw py::error_already_set();
+}
+
+// ─────────────────────────────────────────────────────────────────────────────
+// CuframesFrame — owning wrapper над cuframes_frame_t*.
+//
+// Lifecycle:
+//   - конструируется через Subscriber::next_frame() (single source of truth)
+//   - в destructor'е (или __exit__) автоматически вызывает release
+//   - после release() все property accessor'ы бросают CuframesError
+//   - non-copyable, non-movable из Python (PyObject identity)
+//
+// Frame держит **слабую** ссылку (raw pointer) на subscriber. Если subscriber
+// уничтожен раньше frame'а — released() становится no-op (subscriber разрулит
+// освобождение всех outstanding frames при cuframes_subscriber_destroy).
+// Чтобы избежать use-after-free, frame проверяет sub_alive_ через shared_ptr.
+//
+// Для простоты Phase 0 — frame и subscriber должны жить в одном Python потоке,
+// порядок destruction под управлением Python GC. Refcount на Python-стороне
+// от субскриптора держится через py::object атрибут.
+// ─────────────────────────────────────────────────────────────────────────────
+
+class FrameWrapper {
+public:
+    FrameWrapper(cuframes_subscriber_t* sub, cuframes_frame_t* frame)
+        : sub_(sub), frame_(frame) {}
+
+    ~FrameWrapper() {
+        try { release(); } catch (...) { /* destructor — глотаем */ }
+    }
+
+    // pybind11 не любит copyable wrappers для owning ресурсов.
+    FrameWrapper(const FrameWrapper&) = delete;
+    FrameWrapper& operator=(const FrameWrapper&) = delete;
+
+    bool released() const noexcept { return frame_ == nullptr; }
+
+    void release() {
+        if (frame_ != nullptr) {
+            // sub_ может быть nullptr если subscriber разорвал связь раньше —
+            // в этом случае release уже не нужен (subscriber всё освободил).
+            if (sub_ != nullptr) {
+                cuframes_subscriber_release(sub_, frame_);
+            }
+            frame_ = nullptr;
+        }
+    }
+
+    // Internal hook — subscriber говорит frame'у «я умираю, не release()ай».
+    void invalidate_subscriber() noexcept { sub_ = nullptr; }
+
+    // ── Properties ──────────────────────────────────────────────────────
+    // Все геттеры проверяют released() — иначе CuframesError.
+
+    void check_alive() const {
+        if (frame_ == nullptr) {
+            PyErr_SetString(g_exc.base.ptr(), "frame has been released");
+            throw py::error_already_set();
+        }
+    }
+
+    uintptr_t cuda_ptr() const {
+        check_alive();
+        return reinterpret_cast<uintptr_t>(cuframes_frame_cuda_ptr(frame_));
+    }
+
+    cuframes_format_t format() const {
+        check_alive();
+        return cuframes_frame_format(frame_);
+    }
+
+    int width() const {
+        check_alive();
+        int32_t w, h;
+        cuframes_frame_size(frame_, &w, &h);
+        return w;
+    }
+
+    int height() const {
+        check_alive();
+        int32_t w, h;
+        cuframes_frame_size(frame_, &w, &h);
+        return h;
+    }
+
+    int pitch_y() const {
+        check_alive();
+        return cuframes_frame_pitch_y(frame_);
+    }
+
+    int pitch_uv() const {
+        check_alive();
+        return cuframes_frame_pitch_uv(frame_);
+    }
+
+    uint64_t seq() const {
+        check_alive();
+        return cuframes_frame_seq(frame_);
+    }
+
+    int64_t pts_ns() const {
+        check_alive();
+        return cuframes_frame_pts_ns(frame_);
+    }
+
+    cuframes_subscriber_t* internal_sub() const noexcept { return sub_; }
+    cuframes_frame_t* internal_frame() const noexcept { return frame_; }
+
+private:
+    cuframes_subscriber_t* sub_;
+    cuframes_frame_t* frame_;
+};
+
+// ─────────────────────────────────────────────────────────────────────────────
+// DLPack export helpers.
+//
+// Кадр в NV12 состоит из 2 plane'ов: Y (uint8, H×W, pitch=pitch_y) и
+// UV interleaved (uint8, H/2×W, pitch=pitch_uv; W здесь = ширина в байтах
+// для interleaved U+V).
+//
+// Стратегия: даём пользователю 2 отдельных DLPack capsule на каждый plane.
+// Это стандартный pattern в PyTorch/CuPy (torchcodec, cuda-python).
+// UV offset вычисляется из pitch_y * height_aligned (NVDEC выравнивает
+// height до aligned значения — обычно высота уже aligned, но мы используем
+// видимую height из frame_size).
+//
+// Lifetime: deleter capsule освобождает только shape/strides arrays.
+// Сам CUDA pointer принадлежит frame'у — gone-frame должно быть released
+// **после** того как DLPack capsule destroyed. Чтобы не дать пользователю
+// shoot in foot, capsule.manager_ctx держит py::object на FrameWrapper
+// (увеличивает refcount), которое освобождается в deleter.
+// ─────────────────────────────────────────────────────────────────────────────
+
+struct DLPackContext {
+    py::object frame_keep_alive;  // CuframesFrame Python-side
+    std::vector<int64_t> shape;
+    std::vector<int64_t> strides;
+};
+
+static void dlpack_deleter(dlpack::DLManagedTensor* self) {
+    if (!self) return;
+    auto* ctx = static_cast<DLPackContext*>(self->manager_ctx);
+    if (ctx) {
+        // Releasing Python refcount требует GIL
+        py::gil_scoped_acquire gil;
+        delete ctx;
+    }
+    delete self;
+}
+
+static void dlpack_pycapsule_destructor(PyObject* capsule) {
+    if (PyCapsule_IsValid(capsule, "dltensor")) {
+        // Capsule НЕ был consumed downstream'ом (e.g. torch.from_dlpack).
+        // Нужно освободить managed tensor самим.
+        auto* mt = static_cast<dlpack::DLManagedTensor*>(
+            PyCapsule_GetPointer(capsule, "dltensor"));
+        if (mt && mt->deleter) {
+            mt->deleter(mt);
+        }
+    }
+    // Если PyCapsule имеет name "used_dltensor" — downstream взял ownership,
+    // мы ничего не делаем.
+}
+
+static py::capsule make_dlpack_capsule(
+    void* data,
+    int rows, int cols, int64_t row_stride_bytes,
+    int cuda_device,
+    py::object frame_keep_alive
+) {
+    auto* ctx = new DLPackContext;
+    ctx->frame_keep_alive = std::move(frame_keep_alive);
+    ctx->shape = {static_cast<int64_t>(rows), static_cast<int64_t>(cols)};
+    ctx->strides = {row_stride_bytes, 1};
+
+    auto* mt = new dlpack::DLManagedTensor;
+    mt->dl_tensor.data = data;
+    mt->dl_tensor.device = {dlpack::kDLCUDA, cuda_device};
+    mt->dl_tensor.ndim = 2;
+    mt->dl_tensor.dtype = {dlpack::kDLUInt, 8, 1};  // uint8
+    mt->dl_tensor.shape = ctx->shape.data();
+    mt->dl_tensor.strides = ctx->strides.data();
+    mt->dl_tensor.byte_offset = 0;
+    mt->manager_ctx = ctx;
+    mt->deleter = dlpack_deleter;
+
+    return py::capsule(mt, "dltensor", &dlpack_pycapsule_destructor);
+}
+
+// ─────────────────────────────────────────────────────────────────────────────
+// CuframesSubscriber — owning wrapper над cuframes_subscriber_t*.
+//
+// API:
+//   sub = cuframes.subscribe("cam-parking", consumer_name="yolo-world",
+//                            timeout_ms=5000)
+//   with sub:
+//       with sub.next_frame(timeout_ms=1000) as frame:
+//           do_something(frame.cuda_ptr, frame.width, frame.height)
+//   # sub.close() здесь автоматически
+//
+// Iteration (Phase 0.5):
+//   for frame in sub.frames(timeout_ms=1000):
+//       ...
+// ─────────────────────────────────────────────────────────────────────────────
+
+// Per-subscriber health stats. Phase 0 версия — counted в pybind layer
+// (cuframes C API не expose'ит ring_occupancy / drop_count напрямую).
+// Если в будущем cuframes расширит C API (cuframes_subscriber_get_stats),
+// добавим reads оттуда — но текущие counters остаются для совместимости
+// с тем что consumer'у видно через Python API.
+struct SubscriberStats {
+    uint64_t frames_received = 0;  // успешных next_frame()
+    uint64_t timeouts = 0;         // CUFRAMES_ERR_TIMEOUT / WOULD_BLOCK
+    uint64_t errors = 0;           // прочие fail'ы в next_frame()
+    uint64_t last_seq = 0;         // seq последнего полученного frame'а
+    uint64_t gap_count = 0;        // сколько раз seq[i] > seq[i-1] + 1
+                                   // (proxy для drop count в NEWEST_ONLY mode)
+    int64_t  last_frame_pts_ns = 0;
+};
+
+class SubscriberWrapper {
+public:
+    SubscriberWrapper(
+        const std::string& key,
+        std::optional<std::string> consumer_name,
+        cuframes_subscriber_mode_t mode,
+        int cuda_device,
+        int connect_timeout_ms,
+        uintptr_t consumer_stream
+    ) : key_(key),
+        consumer_name_(consumer_name.value_or("")),
+        mode_(mode),
+        cuda_device_(cuda_device),
+        consumer_stream_(reinterpret_cast<void*>(consumer_stream)) {
+
+        cuframes_subscriber_config_t cfg = {};
+        cfg.key = key_.c_str();
+        cfg.consumer_name = consumer_name.has_value() ? consumer_name_.c_str() : nullptr;
+        cfg.mode = mode_;
+        cfg.cuda_device = cuda_device_;
+        cfg.connect_timeout_ms = connect_timeout_ms;
+
+        // create — может быть блокирующим (ждёт publisher'а). GIL release.
+        int err;
+        {
+            py::gil_scoped_release rel;
+            err = cuframes_subscriber_create(&cfg, &sub_);
+        }
+        check(err, "cuframes_subscriber_create");
+    }
+
+    ~SubscriberWrapper() {
+        try { close(); } catch (...) { /* destructor — глотаем */ }
+    }
+
+    SubscriberWrapper(const SubscriberWrapper&) = delete;
+    SubscriberWrapper& operator=(const SubscriberWrapper&) = delete;
+
+    bool closed() const noexcept { return sub_ == nullptr; }
+
+    void close() {
+        if (sub_ != nullptr) {
+            cuframes_subscriber_destroy(sub_);
+            sub_ = nullptr;
+        }
+    }
+
+    void check_alive() const {
+        if (sub_ == nullptr) {
+            PyErr_SetString(g_exc.base.ptr(), "subscriber has been closed");
+            throw py::error_already_set();
+        }
+    }
+
+    // Возвращает new FrameWrapper. Caller владеет через Python GC.
+    // GIL release на время блокирующего вызова — другие потоки работают.
+    std::unique_ptr<FrameWrapper> next_frame(int timeout_ms) {
+        check_alive();
+        cuframes_frame_t* raw = nullptr;
+        int err;
+        {
+            py::gil_scoped_release rel;
+            // Используем persistent per-subscriber stream — все consumer'ы
+            // получают независимый cudaStreamWaitEvent, не серializуются
+            // через default stream.
+            err = cuframes_subscriber_next(sub_, consumer_stream_,
+                                            &raw, timeout_ms);
+        }
+        // Update health stats до check() — иначе при exception они не
+        // увеличатся, и оператору будет непонятно почему counters застыли.
+        if (err == CUFRAMES_OK) {
+            stats_.frames_received++;
+            uint64_t seq = cuframes_frame_seq(raw);
+            if (stats_.last_seq != 0 && seq > stats_.last_seq + 1) {
+                stats_.gap_count++;
+            }
+            stats_.last_seq = seq;
+            stats_.last_frame_pts_ns = cuframes_frame_pts_ns(raw);
+        } else if (err == CUFRAMES_ERR_TIMEOUT || err == CUFRAMES_ERR_WOULD_BLOCK) {
+            stats_.timeouts++;
+        } else {
+            stats_.errors++;
+        }
+        check(err, "cuframes_subscriber_next");
+        return std::make_unique<FrameWrapper>(sub_, raw);
+    }
+
+    const std::string& key() const { return key_; }
+    const std::string& consumer_name() const { return consumer_name_; }
+    cuframes_subscriber_mode_t mode() const { return mode_; }
+    int cuda_device() const { return cuda_device_; }
+    const SubscriberStats& stats() const { return stats_; }
+
+    // Snapshot stats как Python dict — для MQTT health publish.
+    py::dict stats_dict() const {
+        py::dict d;
+        d["frames_received"]   = stats_.frames_received;
+        d["timeouts"]          = stats_.timeouts;
+        d["errors"]            = stats_.errors;
+        d["last_seq"]          = stats_.last_seq;
+        d["gap_count"]         = stats_.gap_count;
+        d["last_frame_pts_ns"] = stats_.last_frame_pts_ns;
+        return d;
+    }
+
+    uintptr_t consumer_stream() const {
+        return reinterpret_cast<uintptr_t>(consumer_stream_);
+    }
+
+private:
+    cuframes_subscriber_t* sub_ = nullptr;
+    std::string key_;
+    std::string consumer_name_;
+    cuframes_subscriber_mode_t mode_;
+    int cuda_device_;
+    // CUDA stream — opaque cudaStream_t. Передаётся снаружи как int
+    // (полученный через cuda-python / torch.cuda.Stream._as_parameter_).
+    // nullptr = default stream (только для smoke-тестов; в продакшене
+    // консумерам надо иметь свой stream чтобы избежать serialization
+    // через default).
+    void* consumer_stream_ = nullptr;
+    SubscriberStats stats_{};
+};
+
+}  // namespace
+
+PYBIND11_MODULE(_native, m) {
+    m.doc() = "cuframes — zero-copy CUDA frame sharing (native bindings)";
+
+    // ── Версия ──────────────────────────────────────────────────────────
+    m.def("version_string", []() {
+        return std::string(cuframes_version_string());
+    }, "Runtime version of libcuframes (MAJOR.MINOR.PATCH).");
+
+    m.def("protocol_version", []() {
+        return static_cast<uint32_t>(cuframes_protocol_version());
+    }, "Wire-protocol version. Subscribers с разной версией не подключатся.");
+
+    m.attr("__binding_version__") = CUFRAMES_PY_BINDING_VERSION;
+
+    // ── Error taxonomy ──────────────────────────────────────────────────
+    // Иерархия:
+    //   CuframesError (base)
+    //   ├── CuframesPublisherGone
+    //   ├── CuframesFrameTimeout
+    //   ├── CuframesDeviceLost
+    //   ├── CuframesShmError
+    //   ├── CuframesProtocolMismatch
+    //   ├── CuframesInvalidArgument
+    //   ├── CuframesOutOfMemory
+    //   └── CuframesInternal
+    //
+    // py::exception<T>(...) уже возвращает py::object на сам Python class.
+    // Не вызываем .attr("__class__") — иначе получим metaclass.
+
+    g_exc.base = py::exception<std::runtime_error>(m, "CuframesError");
+    auto make_subexc = [&m](const char* name) -> py::object {
+        return py::exception<std::runtime_error>(m, name, g_exc.base.ptr());
+    };
+    g_exc.publisher_gone     = make_subexc("CuframesPublisherGone");
+    g_exc.frame_timeout      = make_subexc("CuframesFrameTimeout");
+    g_exc.device_lost        = make_subexc("CuframesDeviceLost");
+    g_exc.shm_error          = make_subexc("CuframesShmError");
+    g_exc.protocol_mismatch  = make_subexc("CuframesProtocolMismatch");
+    g_exc.invalid_argument   = make_subexc("CuframesInvalidArgument");
+    g_exc.out_of_memory      = make_subexc("CuframesOutOfMemory");
+    g_exc.internal           = make_subexc("CuframesInternal");
+
+    // ── Pixel formats (enum mirror) ─────────────────────────────────────
+    py::enum_<cuframes_format_t>(m, "PixelFormat")
+        .value("NV12",      CUFRAMES_FORMAT_NV12)
+        .value("YUV420P",   CUFRAMES_FORMAT_YUV420P)
+        .value("RGB",       CUFRAMES_FORMAT_RGB)
+        .value("BGR",       CUFRAMES_FORMAT_BGR)
+        .value("RGBA",      CUFRAMES_FORMAT_RGBA)
+        .value("GRAYSCALE", CUFRAMES_FORMAT_GRAYSCALE);
+
+    py::enum_<cuframes_subscriber_mode_t>(m, "SubscriberMode")
+        .value("NEWEST_ONLY",  CUFRAMES_MODE_NEWEST_ONLY)
+        .value("STRICT_ORDER", CUFRAMES_MODE_STRICT_ORDER);
+
+    // ── CuframesFrame ───────────────────────────────────────────────────
+    py::class_<FrameWrapper>(m, "CuframesFrame",
+        "Один кадр от cuframes publisher'а.\n\n"
+        "Получается через CuframesSubscriber.next_frame().\n"
+        "Поддерживает context manager — release() при выходе из with-блока.\n"
+        "Все property accessor'ы после release() бросают CuframesError.\n\n"
+        "Это handle на frame в ring buffer publisher'а — данные остаются\n"
+        "в shared memory publisher'а пока frame не released. Долго удерживать\n"
+        "frame нельзя: medленный consumer заставит publisher либо overwrite\n"
+        "(DROP_OLDEST policy), либо stall (STRICT_WAIT).")
+        // properties (read-only)
+        .def_property_readonly("cuda_ptr", &FrameWrapper::cuda_ptr,
+            "CUDA device pointer на frame data (uintptr_t). Read-only для\n"
+            "consumer'а. Используйте через cuda-python / cupy / torch.from_blob.")
+        .def_property_readonly("format", &FrameWrapper::format,
+            "PixelFormat (NV12 для NVDEC publisher'а).")
+        .def_property_readonly("width", &FrameWrapper::width)
+        .def_property_readonly("height", &FrameWrapper::height)
+        .def_property_readonly("pitch_y", &FrameWrapper::pitch_y,
+            "Pitch (байт на строку) для Y plane. ВАЖНО: для больших\n"
+            "разрешений (2688×1520, gate_lpr) pitch != width — kernel'ы\n"
+            "должны принимать pitch как параметр.")
+        .def_property_readonly("pitch_uv", &FrameWrapper::pitch_uv,
+            "Pitch для UV plane (NV12/YUV420P); 0 для форматов без UV.")
+        .def_property_readonly("seq", &FrameWrapper::seq,
+            "Sequence number — монотонная нумерация у publisher'а.")
+        .def_property_readonly("pts_ns", &FrameWrapper::pts_ns,
+            "Presentation timestamp от publisher'а (наносекунды, CLOCK_MONOTONIC).")
+        .def_property_readonly("released", &FrameWrapper::released)
+        .def("release", &FrameWrapper::release,
+            "Освободить frame обратно publisher'у (ACK).\n"
+            "После release() property accessor'ы бросают CuframesError.\n"
+            "Idempotent — повторный вызов no-op.")
+        // context manager
+        .def("__enter__", [](FrameWrapper& self) -> FrameWrapper& {
+            self.check_alive();
+            return self;
+        }, py::return_value_policy::reference_internal)
+        .def("__exit__", [](FrameWrapper& self, py::object, py::object, py::object) {
+            self.release();
+            return py::none();
+        })
+        .def("__repr__", [](const FrameWrapper& f) {
+            if (f.released()) return std::string("<CuframesFrame released>");
+            return std::string("<CuframesFrame seq=") + std::to_string(f.seq()) +
+                   " size=" + std::to_string(f.width()) + "x" + std::to_string(f.height()) + ">";
+        })
+        // ── DLPack export ───────────────────────────────────────────────
+        // Multi-plane formats (NV12, YUV420P) — экспортируем планы отдельно
+        // как 2D uint8 tensors. Consumer строит логику склейки сам.
+        // Для single-plane (RGB/BGR/RGBA/GRAYSCALE) — __dlpack__() работает.
+        .def("dlpack_y",
+            [](py::object self) -> py::capsule {
+                auto& f = self.cast<FrameWrapper&>();
+                f.check_alive();
+                void* ptr = cuframes_frame_cuda_ptr(f.internal_frame());
+                int32_t w, h;
+                cuframes_frame_size(f.internal_frame(), &w, &h);
+                int pitch = cuframes_frame_pitch_y(f.internal_frame());
+                // Для NV12/YUV420P width = ширина в пикселях, Y занимает W байт/строка.
+                // Pitch (физическая строка в памяти) может быть > W. Передаём как stride.
+                // cuda_device извлекаем не из frame (нет API) — фиксируем 0 для default;
+                // task #201 добавит per-subscriber stream и реальный device.
+                return make_dlpack_capsule(ptr, h, w, pitch, /*cuda_device=*/0, self);
+            },
+            "DLPack export Y-plane как 2D uint8 GPU tensor (shape=[H, W], stride=[pitch_y, 1]).\n"
+            "Работает для NV12, YUV420P, GRAYSCALE. Для других форматов — отдаёт первый plane.")
+        .def("dlpack_uv",
+            [](py::object self) -> py::capsule {
+                auto& f = self.cast<FrameWrapper&>();
+                f.check_alive();
+                auto fmt = cuframes_frame_format(f.internal_frame());
+                if (fmt != CUFRAMES_FORMAT_NV12) {
+                    PyErr_SetString(g_exc.invalid_argument.ptr(),
+                        "dlpack_uv() only supported for NV12 format");
+                    throw py::error_already_set();
+                }
+                void* base = cuframes_frame_cuda_ptr(f.internal_frame());
+                int32_t w, h;
+                cuframes_frame_size(f.internal_frame(), &w, &h);
+                int pitch_y = cuframes_frame_pitch_y(f.internal_frame());
+                int pitch_uv = cuframes_frame_pitch_uv(f.internal_frame());
+                // NV12 layout: Y plane занимает pitch_y * h bytes,
+                // UV plane (interleaved U+V) следует сразу за ним.
+                void* uv_ptr = static_cast<uint8_t*>(base) + (size_t)pitch_y * h;
+                // UV plane размеры: H/2 строк, W колонок (interleaved U+V байты).
+                return make_dlpack_capsule(uv_ptr, h / 2, w, pitch_uv, /*cuda_device=*/0, self);
+            },
+            "DLPack export UV-plane (interleaved) для NV12.\n"
+            "Shape=[H/2, W] uint8, stride=[pitch_uv, 1]. U и V interleaved\n"
+            "по байтам в последнем измерении (W = ширина в пикселях, но\n"
+            "каждый pixel = 2 байта U+V).")
+        .def("__dlpack__",
+            [](py::object self, py::object /*stream*/) -> py::capsule {
+                // PEP 3118 / DLPack protocol — single-plane access.
+                // Для NV12/YUV420P возвращает Y plane (это самый частый use
+                // case — motion detection / brightness работают только с Y).
+                // Если нужен UV — явно через .dlpack_uv().
+                auto& f = self.cast<FrameWrapper&>();
+                f.check_alive();
+                void* ptr = cuframes_frame_cuda_ptr(f.internal_frame());
+                int32_t w, h;
+                cuframes_frame_size(f.internal_frame(), &w, &h);
+                int pitch = cuframes_frame_pitch_y(f.internal_frame());
+                return make_dlpack_capsule(ptr, h, w, pitch, /*cuda_device=*/0, self);
+            },
+            py::arg("stream") = py::none(),
+            "DLPack protocol для torch.from_dlpack / cupy.from_dlpack.\n"
+            "Для NV12 возвращает Y plane. Для других planes — .dlpack_uv().")
+        .def("__dlpack_device__",
+            [](const FrameWrapper& f) -> py::tuple {
+                f.check_alive();
+                // (device_type, device_id) — kDLCUDA=2, device 0 (task #201).
+                return py::make_tuple(2, 0);
+            },
+            "DLPack device protocol — возвращает (kDLCUDA=2, device_id).");
+
+    // ── CuframesSubscriber ──────────────────────────────────────────────
+    py::class_<SubscriberWrapper>(m, "CuframesSubscriber",
+        "Subscription на cuframes publisher.\n\n"
+        "Создаётся через cuframes.subscribe(key, ...). Поддерживает context\n"
+        "manager — close() при выходе из with-блока.\n\n"
+        "Thread-safety contract:\n"
+        "  • Handle принадлежит одному Python потоку — создание и\n"
+        "    все вызовы (next_frame, close) должны быть в одном thread.\n"
+        "  • Несколько subscriber'ов в разных потоках — OK (каждому свой\n"
+        "    handle, свой CUDA stream).\n"
+        "  • Доступ к Frame после release() из другого потока — UB\n"
+        "    (cuframes_frame_t* указывает в ring buffer publisher'а, после\n"
+        "    release он может быть переписан).\n"
+        "  • Внутренний GIL отпускается на длинных I/O вызовах\n"
+        "    (subscriber_create, next_frame) — другие Python потоки могут\n"
+        "    выполняться параллельно пока мы ждём frame.\n\n"
+        "CUDA stream:\n"
+        "  consumer_stream передаётся как int (cudaStream_t как opaque\n"
+        "  pointer). Получается через cuda-python (cudart.cudaStreamCreate)\n"
+        "  или torch (torch.cuda.Stream()._as_parameter_). Если 0 —\n"
+        "  default stream (serialization risk при нескольких subscriber'ах\n"
+        "  в одном процессе).")
+        .def(py::init<const std::string&, std::optional<std::string>,
+                       cuframes_subscriber_mode_t, int, int, uintptr_t>(),
+             py::arg("key"),
+             py::arg("consumer_name") = py::none(),
+             py::arg("mode") = CUFRAMES_MODE_NEWEST_ONLY,
+             py::arg("cuda_device") = 0,
+             py::arg("connect_timeout_ms") = -1,
+             py::arg("consumer_stream") = 0,
+             "Создать subscription. Блокирует до publisher_ready или\n"
+             "connect_timeout_ms. -1 = ждать вечно, 0 = fail сразу.\n"
+             "consumer_stream: int representation cudaStream_t (0=default).")
+        .def_property_readonly("key", &SubscriberWrapper::key)
+        .def_property_readonly("consumer_name", &SubscriberWrapper::consumer_name)
+        .def_property_readonly("mode", &SubscriberWrapper::mode)
+        .def_property_readonly("cuda_device", &SubscriberWrapper::cuda_device)
+        .def_property_readonly("consumer_stream", &SubscriberWrapper::consumer_stream,
+            "Pointer на cudaStream_t (int). 0 = default stream.")
+        .def_property_readonly("closed", &SubscriberWrapper::closed)
+        .def("next_frame", &SubscriberWrapper::next_frame,
+             py::arg("timeout_ms") = -1,
+             "Получить следующий frame.\n\n"
+             "timeout_ms: -1 = ждать вечно; 0 = non-blocking\n"
+             "(CuframesFrameTimeout если нет данных); >0 = с таймаутом.\n\n"
+             "Возвращает CuframesFrame — context manager. Использовать через\n"
+             "`with sub.next_frame() as frame: ...` для гарантии release.")
+        .def("close", &SubscriberWrapper::close,
+             "Закрыть subscription. Idempotent.")
+        // ── Health / stats ──────────────────────────────────────────────
+        // Phase 0: counted в pybind layer (cuframes C API не expose'ит
+        // ring_occupancy / drop_count напрямую). Эти counters достаточно
+        // для MQTT health publisher / monitoring.
+        .def_property_readonly("frames_received",
+            [](const SubscriberWrapper& s) { return s.stats().frames_received; },
+            "Количество успешных next_frame() с момента subscribe.")
+        .def_property_readonly("timeouts",
+            [](const SubscriberWrapper& s) { return s.stats().timeouts; },
+            "Сколько раз next_frame() вернул CuframesFrameTimeout.")
+        .def_property_readonly("errors",
+            [](const SubscriberWrapper& s) { return s.stats().errors; },
+            "Сколько раз next_frame() упал с error (не timeout).")
+        .def_property_readonly("last_seq",
+            [](const SubscriberWrapper& s) { return s.stats().last_seq; },
+            "Sequence number последнего полученного frame'а.")
+        .def_property_readonly("gap_count",
+            [](const SubscriberWrapper& s) { return s.stats().gap_count; },
+            "Сколько раз seq[i] > seq[i-1] + 1 — proxy для drop count\n"
+            "в NEWEST_ONLY mode. В STRICT_ORDER должен оставаться 0.")
+        .def_property_readonly("last_frame_pts_ns",
+            [](const SubscriberWrapper& s) { return s.stats().last_frame_pts_ns; })
+        .def("stats",
+            [](const SubscriberWrapper& s) { return s.stats_dict(); },
+            "Snapshot всех health counters как dict — для MQTT health publish.")
+        // context manager
+        .def("__enter__", [](SubscriberWrapper& self) -> SubscriberWrapper& {
+            self.check_alive();
+            return self;
+        }, py::return_value_policy::reference_internal)
+        .def("__exit__", [](SubscriberWrapper& self, py::object, py::object, py::object) {
+            self.close();
+            return py::none();
+        })
+        .def("__repr__", [](const SubscriberWrapper& s) {
+            return std::string("<CuframesSubscriber key='") + s.key() +
+                   "' closed=" + (s.closed() ? "True" : "False") + ">";
+        });
+
+    // ── Module-level factory ────────────────────────────────────────────
+    // Удобный shortcut: cuframes.subscribe("cam-parking") вместо
+    // cuframes._native.CuframesSubscriber(...).
+    m.def("subscribe",
+        [](const std::string& key,
+           std::optional<std::string> consumer_name,
+           cuframes_subscriber_mode_t mode,
+           int cuda_device,
+           int connect_timeout_ms,
+           uintptr_t consumer_stream) {
+            return std::make_unique<SubscriberWrapper>(
+                key, consumer_name, mode, cuda_device,
+                connect_timeout_ms, consumer_stream);
+        },
+        py::arg("key"),
+        py::arg("consumer_name") = py::none(),
+        py::arg("mode") = CUFRAMES_MODE_NEWEST_ONLY,
+        py::arg("cuda_device") = 0,
+        py::arg("connect_timeout_ms") = -1,
+        py::arg("consumer_stream") = 0,
+        "Создать CuframesSubscriber. Shortcut для CuframesSubscriber(...).");
+}
@@ -0,0 +1,112 @@
+"""Smoke tests для cuframes Python bindings.
+
+В Phase 0 (skeleton) проверяем что:
+  - модуль импортируется
+  - версия читается
+  - error классы существуют и являются нормальной иерархией
+
+Subscriber / DLPack тесты появятся в следующих фазах
+(см. issue gx/cuframes#6, tasks #198+).
+"""
+
+import cuframes
+
+
+def test_version_format():
+    v = cuframes.version_string()
+    assert isinstance(v, str)
+    parts = v.split(".")
+    assert len(parts) >= 3
+    assert all(p.isdigit() for p in parts[:3])
+
+
+def test_protocol_version_is_uint():
+    pv = cuframes.protocol_version()
+    assert isinstance(pv, int)
+    assert pv >= 0
+
+
+def test_pixel_format_enum_members():
+    assert cuframes.PixelFormat.NV12.value == 0
+    assert cuframes.PixelFormat.YUV420P.value == 1
+
+
+def test_subscriber_mode_enum_members():
+    assert cuframes.SubscriberMode.NEWEST_ONLY.value == 0
+    assert cuframes.SubscriberMode.STRICT_ORDER.value == 1
+
+
+def test_error_hierarchy():
+    """Все subtype'ы наследуются от CuframesError."""
+    for sub in [
+        cuframes.CuframesPublisherGone,
+        cuframes.CuframesFrameTimeout,
+        cuframes.CuframesDeviceLost,
+        cuframes.CuframesShmError,
+        cuframes.CuframesProtocolMismatch,
+        cuframes.CuframesInvalidArgument,
+        cuframes.CuframesOutOfMemory,
+        cuframes.CuframesInternal,
+    ]:
+        assert issubclass(sub, cuframes.CuframesError)
+
+
+def test_subscriber_class_exposed():
+    """CuframesSubscriber/CuframesFrame exposed как public classes."""
+    assert hasattr(cuframes, "CuframesSubscriber")
+    assert hasattr(cuframes, "CuframesFrame")
+    assert hasattr(cuframes, "subscribe")
+
+
+def test_subscribe_to_missing_publisher_raises():
+    """Subscribe к несуществующему publisher → CuframesError (subclass)
+    после connect_timeout_ms.
+
+    Этот тест работает на любом хосте (без живого cuframes-pub) — мы
+    верифицируем что error path работает и маппит CUFRAMES_ERR_*
+    в правильный Python exception.
+    """
+    import pytest
+    with pytest.raises(cuframes.CuframesError):
+        cuframes.subscribe(
+            "definitely-not-existing-publisher-xyz",
+            connect_timeout_ms=100,
+        )
+
+
+def test_subscriber_repr_when_unable_to_connect():
+    """Лёгкий тест что repr не падает и close idempotent."""
+    import pytest
+    try:
+        sub = cuframes.subscribe("nope-xyz", connect_timeout_ms=100)
+    except cuframes.CuframesError:
+        return  # ожидаемо
+    pytest.fail("subscribe должно было выкинуть exception")
+
+
+def test_subscribe_accepts_consumer_stream_param():
+    """consumer_stream — uintptr (cudaStream_t).
+
+    Проверяем что параметр accepted; реальное использование требует
+    cuda-python / torch.cuda.Stream — это в integration тестах
+    yolo-world-detector'а.
+    """
+    import pytest
+    with pytest.raises(cuframes.CuframesError):
+        cuframes.subscribe(
+            "nope-xyz",
+            connect_timeout_ms=100,
+            consumer_stream=0,  # 0 = default stream
+        )
+
+
+def test_subscribe_kwargs_signature():
+    """Проверяем что у subscribe правильный набор kwargs."""
+    import inspect
+    # Pybind11-обёртки не дают inspect.signature, но help_doc отражает их.
+    doc = cuframes.subscribe.__doc__
+    assert "consumer_name" in doc
+    assert "mode" in doc
+    assert "cuda_device" in doc
+    assert "connect_timeout_ms" in doc
+    assert "consumer_stream" in doc
@@ -0,0 +1,4 @@
+vmm_fd_pingpong/producer
+vmm_fd_pingpong/consumer
+smoke_v04/smoke_pub
+smoke_v04/smoke_sub
@@ -0,0 +1,13 @@
+CFLAGS  = -O2 -Wall -I../../include -I/usr/local/cuda/include
+LDFLAGS = -L../../build-v04/libcuframes -lcuframes -L/usr/local/cuda/lib64 -lcudart -lcuda -lpthread -lrt
+
+all: smoke_pub smoke_sub
+
+smoke_pub: smoke_pub.c
+	gcc $(CFLAGS) -o $@ $< $(LDFLAGS)
+
+smoke_sub: smoke_sub.c
+	gcc $(CFLAGS) -o $@ $< $(LDFLAGS)
+
+clean:
+	rm -f smoke_pub smoke_sub
@@ -0,0 +1,55 @@
+/* v0.4 smoke test publisher — NV12 1920x1080 ring 4, fill каждый slot
+ * с pattern (i % 256), publish, infinite loop. */
+#include <cuframes/cuframes.h>
+#include <cuda_runtime.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <time.h>
+#include <unistd.h>
+
+int main(int argc, char **argv) {
+    const char *key = argc > 1 ? argv[1] : "smoke";
+
+    cuframes_publisher_config_t cfg = {0};
+    cfg.key = key;
+    cfg.width = 1920;
+    cfg.height = 1080;
+    cfg.format = CUFRAMES_FORMAT_NV12;
+    cfg.ownership = CUFRAMES_OWNERSHIP_LIBRARY;
+    cfg.ring_size = 4;
+    cfg.policy = CUFRAMES_POLICY_DROP_OLDEST;
+    cfg.cuda_device = 0;
+
+    cuframes_publisher_t *pub = NULL;
+    int r = cuframes_publisher_create(&cfg, &pub);
+    if (r != CUFRAMES_OK) {
+        fprintf(stderr, "publisher create failed: %d (%s)\n", r, cuframes_strerror(r));
+        return 1;
+    }
+    fprintf(stderr, "publisher 'cuframes-%s' ready (v0.4 VMM)\n", key);
+
+    cudaStream_t stream;
+    cudaStreamCreate(&stream);
+
+    int i = 0;
+    while (1) {
+        void *ptr = NULL;
+        r = cuframes_publisher_acquire(pub, &ptr);
+        if (r != CUFRAMES_OK) { fprintf(stderr, "acquire: %d\n", r); break; }
+
+        uint8_t pattern = (uint8_t)(i & 0xFF);
+        cudaMemsetAsync(ptr, pattern, 1920 * 1080 * 3 / 2, stream);
+
+        r = cuframes_publisher_publish(pub, stream,
+                                        (int64_t)cuframes_now_ns());
+        if (r != CUFRAMES_OK) { fprintf(stderr, "publish: %d\n", r); break; }
+        i++;
+        if (i % 50 == 0) fprintf(stderr, "published %d frames\n", i);
+        struct timespec ts = {.tv_sec = 0, .tv_nsec = 40000000};  /* 25 fps */
+        nanosleep(&ts, NULL);
+    }
+
+    cudaStreamDestroy(stream);
+    cuframes_publisher_destroy(pub);
+    return 0;
+}
@@ -0,0 +1,63 @@
+/* v0.4 smoke subscriber — connect, read 100 frames, verify pattern, exit 0/1. */
+#include <cuframes/cuframes.h>
+#include <cuda_runtime.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+int main(int argc, char **argv) {
+    const char *key = argc > 1 ? argv[1] : "smoke";
+
+    cuframes_subscriber_config_t cfg = {0};
+    cfg.key = key;
+    cfg.consumer_name = "smoke-sub";
+    cfg.mode = CUFRAMES_MODE_NEWEST_ONLY;
+    cfg.cuda_device = 0;
+    cfg.connect_timeout_ms = 10000;
+
+    cuframes_subscriber_t *sub = NULL;
+    int r = cuframes_subscriber_create(&cfg, &sub);
+    if (r != CUFRAMES_OK) {
+        fprintf(stderr, "subscriber create failed: %d (%s)\n", r, cuframes_strerror(r));
+        return 1;
+    }
+    fprintf(stderr, "subscribed to '%s' (v0.4)\n", key);
+
+    cudaStream_t stream;
+    cudaStreamCreate(&stream);
+    size_t check_size = 1024;  /* sample 1KB чтобы не тратить время */
+    uint8_t *host = malloc(check_size);
+
+    int frames = 0;
+    int good = 0;
+    while (frames < 100) {
+        cuframes_frame_t *f = NULL;
+        r = cuframes_subscriber_next(sub, stream, &f, 2000);
+        if (r != CUFRAMES_OK) {
+            fprintf(stderr, "next failed: %d (%s)\n", r, cuframes_strerror(r));
+            break;
+        }
+        cudaMemcpyAsync(host, cuframes_frame_cuda_ptr(f), check_size,
+                         cudaMemcpyDeviceToHost, stream);
+        cudaStreamSynchronize(stream);
+        uint8_t exp = host[0];
+        int mismatch = 0;
+        for (size_t i = 1; i < check_size; i++) {
+            if (host[i] != exp) { mismatch++; }
+        }
+        if (mismatch == 0) good++;
+        if (frames % 20 == 0) {
+            fprintf(stderr, "frame seq=%lu byte0=0x%02x mismatch=%d\n",
+                    (unsigned long)cuframes_frame_seq(f), exp, mismatch);
+        }
+        cuframes_subscriber_release(sub, f);
+        frames++;
+    }
+    free(host);
+    cudaStreamDestroy(stream);
+    cuframes_subscriber_destroy(sub);
+
+    fprintf(stderr, "DONE: %d/%d frames OK\n", good, frames);
+    return (good == frames && frames > 0) ? 0 : 1;
+}
@@ -0,0 +1,16 @@
+CC      = gcc
+CFLAGS  = -O2 -Wall -I/usr/local/cuda/include
+LDFLAGS = -L/usr/local/cuda/lib64 -lcuda
+
+all: producer consumer
+
+producer: producer.c common.h
+	$(CC) $(CFLAGS) -o $@ producer.c $(LDFLAGS)
+
+consumer: consumer.c common.h
+	$(CC) $(CFLAGS) -o $@ consumer.c $(LDFLAGS)
+
+clean:
+	rm -f producer consumer
+
+.PHONY: all clean
@@ -0,0 +1,69 @@
+# vmm_fd_pingpong — spike для cuframes v0.4
+
+Проверка: можно ли заменить CUDA IPC mem handles на VMM (cuMemCreate)
+ POSIX FD export, чтобы убрать требование shared pid/ipc namespaces
+между producer и consumer контейнерами.
+
+## Результат: ✅ работает
+
+Запуск 2 контейнеров без shared pid/ipc, только volume mount для
+unix-сокета:
+
+```
+producer: granularity=2097152
+producer: alloc size=16777216
+producer: exported fd=37 for handle
+producer: listening on /run/spike/pingpong.sock, awaiting consumer...
+
+consumer: connected to producer
+consumer: recv fd=38 size=16777216 magic=0xa7
+consumer: imported handle OK
+consumer: mapped + access OK
+consumer: verify mismatch=0/1048576  → ACK=O
+consumer: done (OK)
+```
+
+## Ключевые наблюдения
+
+- **Granularity на 5090 = 2 MB**. 1920×1080 NV12 (~3.1 MB) округлится до 4 MB.
+  16 slots × 4 камеры × +1 MB = +64 MB VRAM поверх текущих cuda IPC аллокаций.
+- **FD передаётся через `sendmsg(SCM_RIGHTS)`** — kernel прокидывает реальный FD
+  в receiver namespace, переименовывая в свободный номер. Volume mount unix
+  socket'а — единственное требование (`/run/cuframes` уже монтируется как shared).
+- **`cuMemImportFromShareableHandle`** принимает FD как `(void *)(uintptr_t)fd`.
+- **Доступ на consumer side требует `cuMemSetAccess` с правильным `CUmemLocation`** —
+  device id из своего `cuDeviceGet`, не наследуется от producer.
+
+## Замена events (упрощение этапа C)
+
+CUDA events для IPC не имеют POSIX FD path. Внедрять external semaphores
+(OPAQUE_FD) — отдельный API, другая sigal/wait семантика. **Вместо этого:**
+producer вызывает `cuStreamSynchronize(stream)` ПЕРЕД `atomic_store(seq)` в
+`do_publish`. Consumer тогда просто читает seq и копирует DtoD — без event wait.
+
+Overhead: ~1 ms на publish × 25 fps = 2.5% CPU time producer'а. Memory
+coherence гарантирована (один GPU, hardware ensures writes visible после
+stream sync).
+
+## Сборка
+
+```bash
+docker run --rm -v $PWD:/work -w /work nvidia/cuda:12.4.1-devel-ubuntu22.04 \
+    bash -c "apt-get install -y build-essential && make"
+```
+
+## Запуск теста
+
+```bash
+sudo mkdir -p /var/run/spike-pingpong && sudo chmod 777 /var/run/spike-pingpong
+
+docker run -d --name spike-prod --runtime=nvidia --gpus all \
+    -v $PWD:/work -v /var/run/spike-pingpong:/run/spike \
+    nvidia/cuda:12.4.1-base-ubuntu22.04 /work/producer
+
+docker run --rm --name spike-cons --runtime=nvidia --gpus all \
+    -v $PWD:/work -v /var/run/spike-pingpong:/run/spike \
+    nvidia/cuda:12.4.1-base-ubuntu22.04 /work/consumer
+
+docker logs spike-prod && docker rm -f spike-prod
+```
@@ -0,0 +1,20 @@
+#pragma once
+#include <cuda.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+
+#define POOL_SIZE (16 * 1024 * 1024)
+#define MAGIC_BYTE 0xA7
+#define SOCK_PATH "/run/spike/pingpong.sock"
+
+#define CHECK(expr) do {                                                       \
+    CUresult _r = (expr);                                                      \
+    if (_r != CUDA_SUCCESS) {                                                  \
+        const char *_msg = NULL;                                               \
+        cuGetErrorString(_r, &_msg);                                           \
+        fprintf(stderr, "%s:%d %s -> %d (%s)\n",                               \
+                __FILE__, __LINE__, #expr, (int)_r, _msg ? _msg : "?");        \
+        exit(1);                                                               \
+    }                                                                          \
+} while (0)
@@ -0,0 +1,97 @@
+#include "common.h"
+#include <errno.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <unistd.h>
+
+static int recv_fd(int sock, int *out_fd, uint64_t *out_size, uint8_t *out_magic) {
+    struct msghdr msg = {0};
+    char ctrl[CMSG_SPACE(sizeof(int))];
+    struct iovec iov[2];
+    iov[0].iov_base = out_size; iov[0].iov_len = sizeof(*out_size);
+    iov[1].iov_base = out_magic; iov[1].iov_len = sizeof(*out_magic);
+    msg.msg_iov = iov; msg.msg_iovlen = 2;
+    msg.msg_control = ctrl; msg.msg_controllen = sizeof(ctrl);
+    ssize_t n = recvmsg(sock, &msg, 0);
+    if (n < 0) { perror("recvmsg"); return -1; }
+    struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
+    if (!cmsg || cmsg->cmsg_level != SOL_SOCKET || cmsg->cmsg_type != SCM_RIGHTS) {
+        fprintf(stderr, "no SCM_RIGHTS in msg\n");
+        return -1;
+    }
+    memcpy(out_fd, CMSG_DATA(cmsg), sizeof(int));
+    return 0;
+}
+
+int main(void) {
+    CHECK(cuInit(0));
+    CUdevice dev;
+    CHECK(cuDeviceGet(&dev, 0));
+    CUcontext ctx;
+    CHECK(cuCtxCreate(&ctx, 0, dev));
+
+    /* Connect to producer */
+    int sock = socket(AF_UNIX, SOCK_STREAM, 0);
+    if (sock < 0) { perror("socket"); return 1; }
+    struct sockaddr_un sa = {.sun_family = AF_UNIX};
+    strncpy(sa.sun_path, SOCK_PATH, sizeof(sa.sun_path) - 1);
+
+    for (int retry = 0; retry < 50; retry++) {
+        if (connect(sock, (struct sockaddr *)&sa, sizeof(sa)) == 0) break;
+        if (retry == 49) { perror("connect (final)"); return 1; }
+        usleep(100000);
+    }
+    fprintf(stderr, "consumer: connected to producer\n");
+
+    int fd = -1;
+    uint64_t size = 0;
+    uint8_t magic = 0;
+    if (recv_fd(sock, &fd, &size, &magic) < 0) return 1;
+    fprintf(stderr, "consumer: recv fd=%d size=%llu magic=0x%02x\n",
+            fd, (unsigned long long)size, magic);
+
+    CUmemGenericAllocationHandle mem;
+    CHECK(cuMemImportFromShareableHandle(&mem, (void *)(uintptr_t)fd,
+              CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR));
+    fprintf(stderr, "consumer: imported handle OK\n");
+
+    CUdeviceptr ptr;
+    CHECK(cuMemAddressReserve(&ptr, size, 0, 0, 0));
+    CHECK(cuMemMap(ptr, size, 0, mem, 0));
+
+    CUmemAccessDesc access = {0};
+    access.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
+    access.location.id = dev;
+    access.flags = CU_MEM_ACCESS_FLAGS_PROT_READ;
+    CHECK(cuMemSetAccess(ptr, size, &access, 1));
+    fprintf(stderr, "consumer: mapped + access OK\n");
+
+    /* Copy out 1MB чтобы убедиться что pattern там */
+    size_t check = size < (1 << 20) ? size : (1 << 20);
+    uint8_t *host = malloc(check);
+    CHECK(cuMemcpyDtoH(host, ptr, check));
+    CHECK(cuCtxSynchronize());
+
+    size_t mismatch = 0;
+    for (size_t i = 0; i < check; i++) {
+        if (host[i] != magic) mismatch++;
+    }
+    free(host);
+
+    char ack = (mismatch == 0) ? 'O' : 'X';
+    fprintf(stderr, "consumer: verify mismatch=%zu/%zu  → ACK=%c\n",
+            mismatch, check, ack);
+
+    write(sock, &ack, 1);
+    close(sock);
+    close(fd);
+
+    CHECK(cuMemUnmap(ptr, size));
+    CHECK(cuMemAddressFree(ptr, size));
+    CHECK(cuMemRelease(mem));
+    CHECK(cuCtxDestroy(ctx));
+
+    fprintf(stderr, "consumer: done (%s)\n", ack == 'O' ? "OK" : "FAIL");
+    return ack == 'O' ? 0 : 1;
+}
@@ -0,0 +1,103 @@
+#include "common.h"
+#include <errno.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <unistd.h>
+
+/* Send fd через SCM_RIGHTS вместе с (uint64_t size, uint8_t magic) payload. */
+static int send_fd(int sock, int fd, uint64_t size, uint8_t magic) {
+    struct msghdr msg = {0};
+    char ctrl[CMSG_SPACE(sizeof(int))];
+    struct iovec iov[2];
+    iov[0].iov_base = &size; iov[0].iov_len = sizeof(size);
+    iov[1].iov_base = &magic; iov[1].iov_len = sizeof(magic);
+    msg.msg_iov = iov; msg.msg_iovlen = 2;
+    msg.msg_control = ctrl; msg.msg_controllen = sizeof(ctrl);
+    struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
+    cmsg->cmsg_level = SOL_SOCKET;
+    cmsg->cmsg_type = SCM_RIGHTS;
+    cmsg->cmsg_len = CMSG_LEN(sizeof(int));
+    memcpy(CMSG_DATA(cmsg), &fd, sizeof(int));
+    ssize_t n = sendmsg(sock, &msg, 0);
+    if (n < 0) { perror("sendmsg"); return -1; }
+    return 0;
+}
+
+int main(void) {
+    CHECK(cuInit(0));
+    CUdevice dev;
+    CHECK(cuDeviceGet(&dev, 0));
+    CUcontext ctx;
+    CHECK(cuCtxCreate(&ctx, 0, dev));
+
+    CUmemAllocationProp prop = {0};
+    prop.type = CU_MEM_ALLOCATION_TYPE_PINNED;
+    prop.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
+    prop.location.id = dev;
+    prop.requestedHandleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR;
+
+    size_t granularity = 0;
+    CHECK(cuMemGetAllocationGranularity(&granularity, &prop,
+              CU_MEM_ALLOC_GRANULARITY_MINIMUM));
+    fprintf(stderr, "producer: granularity=%zu\n", granularity);
+
+    size_t size = ((POOL_SIZE + granularity - 1) / granularity) * granularity;
+    fprintf(stderr, "producer: alloc size=%zu\n", size);
+
+    CUmemGenericAllocationHandle mem;
+    CHECK(cuMemCreate(&mem, size, &prop, 0));
+
+    CUdeviceptr ptr;
+    CHECK(cuMemAddressReserve(&ptr, size, 0, 0, 0));
+    CHECK(cuMemMap(ptr, size, 0, mem, 0));
+
+    CUmemAccessDesc access = {0};
+    access.location = prop.location;
+    access.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
+    CHECK(cuMemSetAccess(ptr, size, &access, 1));
+
+    /* Fill with MAGIC pattern */
+    CHECK(cuMemsetD8(ptr, MAGIC_BYTE, size));
+    CHECK(cuCtxSynchronize());
+
+    int fd;
+    CHECK(cuMemExportToShareableHandle(&fd, mem,
+              CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR, 0));
+    fprintf(stderr, "producer: exported fd=%d for handle\n", fd);
+
+    /* Unix socket server */
+    unlink(SOCK_PATH);
+    int srv = socket(AF_UNIX, SOCK_STREAM, 0);
+    if (srv < 0) { perror("socket"); return 1; }
+    struct sockaddr_un sa = {.sun_family = AF_UNIX};
+    strncpy(sa.sun_path, SOCK_PATH, sizeof(sa.sun_path) - 1);
+    if (bind(srv, (struct sockaddr *)&sa, sizeof(sa)) < 0) { perror("bind"); return 1; }
+    if (listen(srv, 1) < 0) { perror("listen"); return 1; }
+
+    fprintf(stderr, "producer: listening on %s, awaiting consumer...\n", SOCK_PATH);
+    int cli = accept(srv, NULL, NULL);
+    if (cli < 0) { perror("accept"); return 1; }
+
+    if (send_fd(cli, fd, (uint64_t)size, MAGIC_BYTE) < 0) return 1;
+    fprintf(stderr, "producer: sent fd + size=%zu + magic=0x%02x\n",
+            size, MAGIC_BYTE);
+
+    /* Wait for consumer ACK */
+    char ack;
+    if (read(cli, &ack, 1) != 1) { perror("read ack"); return 1; }
+    fprintf(stderr, "producer: got ACK=0x%02x\n", (unsigned char)ack);
+
+    close(cli);
+    close(srv);
+    unlink(SOCK_PATH);
+    close(fd);
+
+    CHECK(cuMemUnmap(ptr, size));
+    CHECK(cuMemAddressFree(ptr, size));
+    CHECK(cuMemRelease(mem));
+    CHECK(cuCtxDestroy(ctx));
+
+    fprintf(stderr, "producer: done\n");
+    return ack == 'O' ? 0 : 1;
+}
@@ -61,6 +61,8 @@ struct Args {
    bool realtime = false;  // emulate -re у ffmpeg CLI: sleep по pts
    bool loop = false;      // loop input на eof (для file://)
    bool enable_packet_ring = false;  // v0.2 — публиковать encoded packets
+    std::string policy = "drop";  // "drop" = DROP_OLDEST, "wait" = STRICT_WAIT
+    int ack_timeout_ms = 200;     // only used при policy=wait; <=0 = infinite (unsafe)
 };

 static void print_usage() {
@@ -78,6 +80,14 @@ static void print_usage() {
        "  --loop              loop input на EOF (только для file://)\n"
        "  --enable-packet-ring  v0.2: дополнительно публиковать encoded packets\n"
        "                       (для consumer'ов с -c:v copy, Frigate record path)\n"
+        "  --policy MODE        drop (default) = DROP_OLDEST — producer wrap'ает ring\n"
+        "                       без ожидания consumer ack. Подходит для multi-consumer.\n"
+        "                       wait = STRICT_WAIT — producer ждёт ack от всех subscribers\n"
+        "                       перед overwrite. Безопаснее для frame integrity, но slow\n"
+        "                       consumer задерживает all (default ack-timeout 200ms).\n"
+        "  --ack-timeout-ms N   только при --policy wait. Max wait для ack (default 200).\n"
+        "                       <=0 = infinite — НЕ РЕКОМЕНДУЕТСЯ (dead consumer вешает\n"
+        "                       producer навсегда).\n"
        "  --verbose           debug logs\n"
        "  -h, --help          this help\n";
 }
@@ -96,11 +106,23 @@ static int parse_args(int argc, char **argv, Args &a) {
        else if (s == "--realtime") a.realtime = true;
        else if (s == "--loop") a.loop = true;
        else if (s == "--enable-packet-ring") a.enable_packet_ring = true;
+        else if (s == "--policy") a.policy = next();
+        else if (s == "--ack-timeout-ms") a.ack_timeout_ms = std::stoi(next());
        else if (s == "--verbose") a.verbose = true;
        else if (s == "-h" || s == "--help") { print_usage(); std::exit(0); }
        else { std::cerr << "Unknown arg: " << s << "\n"; print_usage(); std::exit(1); }
    }
    if (a.rtsp_url.empty() || a.key.empty()) { print_usage(); return 1; }
+    if (a.policy != "drop" && a.policy != "wait") {
+        std::cerr << "Invalid --policy '" << a.policy << "' (use drop|wait)\n";
+        return 1;
+    }
+    if (a.policy == "wait" && a.ack_timeout_ms <= 0) {
+        std::cerr << "WARNING: --policy wait + --ack-timeout-ms<=0 = infinite wait.\n"
+                  << "  Dead consumer повесит producer навсегда. Forcing к 200ms.\n"
+                  << "  Set явно --ack-timeout-ms 200 (или больше) чтобы убрать warning.\n";
+        a.ack_timeout_ms = 200;
+    }
    return 0;
 }

@@ -209,35 +231,33 @@ int main(int argc, char **argv) {
        return 2;
    }

-    /* Pre-allocate cuframes pool (NV12 — что nvdec выдаёт) */
+    /* Pre-allocate cuframes pool (NV12 — что nvdec выдаёт).
+     * v0.4: publisher сам аллоцирует через cuMemCreate (VMM). Раньше tool
+     * передавал external pool, но v0.4 не может export'нуть cudaMalloc-pointers
+     * как POSIX FD — VMM API требует cuMemCreate-allocated memory. */
    int32_t pitch_y = 0, pitch_uv = 0;
    size_t frame_size = cuframes::calc_frame_size(CUFRAMES_FORMAT_NV12,
                                                    width, height,
                                                    &pitch_y, &pitch_uv);

    cudaSetDevice(a.cuda_device);
-    std::vector<void *> pool(a.ring_size, nullptr);
-    for (int i = 0; i < a.ring_size; ++i) {
-        cudaError_t cerr = cudaMalloc(&pool[i], frame_size);
-        if (cerr != cudaSuccess) {
-            std::cerr << "cudaMalloc pool[" << i << "]: " << cudaGetErrorString(cerr) << "\n";
-            return 2;
-        }
-    }

    cuframes::PublisherOptions po;
    po.key = a.key;
    po.width = width;
    po.height = height;
    po.format = CUFRAMES_FORMAT_NV12;
-    po.policy = CUFRAMES_POLICY_DROP_OLDEST;
+    po.policy = (a.policy == "wait")
+        ? CUFRAMES_POLICY_STRICT_WAIT
+        : CUFRAMES_POLICY_DROP_OLDEST;
+    po.consumer_ack_timeout_ms = a.ack_timeout_ms;
    po.cuda_device = a.cuda_device;
-    po.ring_size = a.ring_size;  /* для logging */
+    po.ring_size = a.ring_size;

-    cuframes::Publisher pub(po, pool.data(), a.ring_size, frame_size);
+    cuframes::Publisher pub(po);   /* LIBRARY ownership — publisher owns VMM pool */
    std::cerr << "[cuframes-src] publisher 'cuframes-" << a.key
-              << "' ready, ring=" << a.ring_size
-              << " pool_size=" << frame_size << " bytes/frame\n";
+              << "' ready (v0.4 VMM), ring=" << a.ring_size
+              << " frame_size=" << frame_size << " bytes\n";

    /* v0.2 — encoded packet ring (опционально). */
    if (a.enable_packet_ring) {
@@ -268,7 +288,6 @@ int main(int argc, char **argv) {
    AVFrame *frame = av_frame_alloc();
    if (!pkt || !frame) return 2;

-    int pool_idx = 0;
    uint64_t frame_count = 0;
    auto t_last_log = std::chrono::steady_clock::now();
    uint64_t last_log_count = 0;
@@ -350,7 +369,15 @@ int main(int argc, char **argv) {
            int src_pitch_y = frame->linesize[0];
            int src_pitch_uv = frame->linesize[1];

-            void *dst = pool[pool_idx];
+            /* v0.4: acquire slot из publisher's VMM pool */
+            void *dst = nullptr;
+            try {
+                dst = pub.acquire();
+            } catch (const cuframes::Error &e) {
+                std::cerr << "acquire: " << e.what() << "\n";
+                av_frame_unref(frame);
+                continue;
+            }

            /* D2D 2D-copy Y plane */
            cudaError_t cerr = cudaMemcpy2DAsync(
@@ -388,14 +415,13 @@ int main(int argc, char **argv) {

            int64_t pts_ns = cuframes::now_ns();
            try {
-                pub.publish_external(dst, stream, pts_ns);
+                pub.publish(stream, pts_ns);
            } catch (const cuframes::Error &e) {
-                std::cerr << "publish_external: " << e.what() << "\n";
+                std::cerr << "publish: " << e.what() << "\n";
                av_frame_unref(frame);
                continue;
            }

-            pool_idx = (pool_idx + 1) % a.ring_size;
            frame_count++;
            av_frame_unref(frame);

@@ -422,9 +448,7 @@ int main(int argc, char **argv) {
    av_buffer_unref(&hw_device);

    cudaStreamDestroy(stream);
-    /* Publisher destructor freed first; теперь освободим pool */
-    /* Note: publisher уже destroyed by RAII, IPC handles closed by subscribers */
-    for (auto p : pool) if (p) cudaFree(p);
+    /* v0.4: publisher owns VMM pool — destructor освободит cuMemRelease etc. */

    return 0;
 }
Author	SHA1	Message	Date
gx	7f4bdfcaab	Merge pull request 'Python bindings (pybind11) — Phase 0 v1' (#7 ) from feat/python-bindings into main build / cmake build (CUDA 12.4, Ubuntu 22.04) (push) Failing after 39s Details build / ffmpeg filter patch (out-of-tree) (push) Has been skipped Details	2026-06-13 21:34:29 +01:00
gx	afc2dd7fff	python: DLPack + health stats + CUDA stream + docs (tasks #199-#202) build / cmake build (CUDA 12.4, Ubuntu 22.04) (pull_request) Failing after 1m50s Details build / ffmpeg filter patch (out-of-tree) (pull_request) Has been skipped Details #199 DLPack export: - frame.dlpack_y() / .dlpack_uv() — explicit multi-plane access для NV12 - frame.__dlpack__() / __dlpack_device__() — protocol для torch/cupy - Capsule deleter правильно держит refcount на frame_keep_alive, releases shape/strides arrays. CUDA pointer принадлежит frame. #200 Health/stats counters: - frames_received, timeouts, errors — per-call counters - last_seq, gap_count — proxy для drop count (NEWEST_ONLY mode) - last_frame_pts_ns - stats() — snapshot dict для MQTT health publish - counted в pybind layer т.к. C API не expose'ит ring_occupancy #201 Per-subscriber CUDA stream + thread-safety: - consumer_stream kwarg в subscribe() — int (cudaStream_t pointer) - subscriber.consumer_stream property - Thread-safety contract в docstring CuframesSubscriber - next_frame() передаёт consumer_stream_ в cuframes_subscriber_next #202 Smoke test + docs: - 10/10 pytest passed (расширен +2 теста на consumer_stream) - docs/python.md (~250 строк): quick start, API reference, integration с PyTorch/CuPy, reconnect-loop pattern, per-stream usage, pitch alignment, thread-safety, error taxonomy, backpressure, Phase 0 limitations Verify build + tests: cmake -B build-python -DBUILD_PYTHON_BINDINGS=ON cmake --build build-python -j pytest python/tests/ -v # 10/10 Закрывает Phase 0 issue gx/cuframes#6. Разблокирует goldix-smart-home/yolo-world-detector Phase 1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-13 21:33:21 +01:00
gx	5d1eaedb38	python: CuframesSubscriber + CuframesFrame wrapper (task #198 ) Реализует subscriber-side wrapper над cuframes_subscriber_* и cuframes_frame_* C API. Что добавлено: - CuframesFrame — owning RAII wrapper над cuframes_frame_t* - properties: cuda_ptr, format, width, height, pitch_y, pitch_uv, seq, pts_ns, released - release() idempotent - context manager (__enter__/__exit__) — release при выходе - после release() property access бросает CuframesError - CuframesSubscriber — owning RAII wrapper над cuframes_subscriber_t* - конструктор с key/consumer_name/mode/cuda_device/connect_timeout_ms - next_frame(timeout_ms) → CuframesFrame - close() idempotent - context manager - GIL released на блокирующих вызовах (create, next_frame) - subscribe() — module-level factory shortcut Архитектурные решения: - GIL release в py::gil_scoped_release на subscriber_create и _next — чтобы другие Python потоки могли работать пока ждём frame - consumer_stream передаётся как nullptr в Phase 0 (default stream); per-subscriber stream в task #201 - Frame держит raw pointer на subscriber, refcount Python-стороной; если subscriber уничтожен раньше, frame.release() становится no-op Smoke tests расширены до 8 — добавлены проверки exposed API и error mapping на subscribe к несуществующему publisher'у. Verify: pytest tests/test_smoke.py — 8/8 passed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-13 21:23:42 +01:00
gx	7b6d43efeb	python: fix exception hierarchy — не вызывать .attr("__class__") py::exception<T>(...) уже возвращает Python class object. Дополнительный .attr("__class__") давал metaclass (type), из-за чего issubclass() проверка для всех subexc возвращала False. Verify: pytest tests/test_smoke.py — 5/5 passed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-13 21:19:03 +01:00
gx	a7da4ea728	python: skeleton pybind11 bindings (issue #6 task #197 ) Каркас Python-пакета `cuframes`: - python/pyproject.toml — scikit-build-core конфиг - python/CMakeLists.txt — pybind11 module через FetchContent - python/src/_native.cpp — module entry, error таксономия, enum mirrors (PixelFormat, SubscriberMode), version - python/cuframes/__init__.py — re-export публичного API - python/tests/test_smoke.py — smoke tests без real subscribe - python/README.md — статус + build instructions - CMakeLists.txt — подключение python/ при BUILD_PYTHON_BINDINGS=ON Реальный subscriber/frame wrapper в следующих коммитах (tasks #198-#202). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-13 12:59:04 +01:00
gx	655649f4d8	cmake: использовать PROJECT_SOURCE_DIR вместо CMAKE_SOURCE_DIR build / cmake build (CUDA 12.4, Ubuntu 22.04) (pull_request) Failing after 5m19s Details build / ffmpeg filter patch (out-of-tree) (pull_request) Has been skipped Details build / cmake build (CUDA 12.4, Ubuntu 22.04) (push) Failing after 4m14s Details build / ffmpeg filter patch (out-of-tree) (push) Has been skipped Details При сборке cuframes как подпроекта родительского CMake-проекта (add_subdirectory) CMAKE_SOURCE_DIR указывает на корень родителя, а не cuframes. Из-за этого target_include_directories cuframes получал неверный путь и компиляция падала с fatal error: cuframes/cuframes.h: No such file or directory PROJECT_SOURCE_DIR резолвится в каталог project(), то есть всегда указывает на корень cuframes независимо от способа подключения. Standalone-сборка ведёт себя как раньше — оба пути одинаковы.	2026-06-03 04:27:24 +01:00
Claude Opus	78824c4ed1	docker: +mosquitto-clients в runtime image build / cmake build (CUDA 12.4, Ubuntu 22.04) (push) Successful in 1m42s Details build / ffmpeg filter patch (out-of-tree) (push) Failing after 1m22s Details Нужен для loop-publisher.sh wrapper в cctv stack — heartbeat и alert MQTT publish. 4.5 MB добавил, runtime image теперь ~590 MB. Без него wrapper silent fail на mqtt_alert/mqtt_state (но retry-loop работает). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-02 17:59:56 +01:00
gx	4862247fe2	v0.4: VMM + POSIX FD — namespace decoupling (no pid share required) build / cmake build (CUDA 12.4, Ubuntu 22.04) (push) Successful in 1m46s Details build / ffmpeg filter patch (out-of-tree) (push) Failing after 1m30s Details Заменяет cudaMalloc + cudaIpcGetMemHandle на cuMemCreate (VMM) + cuMemExportToShareableHandle(POSIX_FILE_DESCRIPTOR). FDs передаются consumer'у через sendmsg(SCM_RIGHTS) в handshake. Frigate (s6-overlay не даёт share PID) и любой другой consumer работают БЕЗ pid namespace share — только volume mount unix socket'a /run/cuframes и IPC share для /dev/shm header. Sync: cudaEventRecord+IPC events → cuStreamSynchronize в do_publish. Producer ждёт ~1 ms что stream flush'нулся, потом atomic_store(seq). Consumer читает seq через memory_order_acquire и копирует DtoD без event wait — HW coherence гарантирована на одном GPU. ABI break (согласован с user'ом): - magic 0xCC7C1DCC → 0xCC7C1DCE (старые consumers fail cleanly) - protocol V3 → V4 - libcuframes.so.0 SOVERSION остаётся, но .so.0.3.0 → .so.0.4.0 - EXTERNAL ownership убран (VMM требует cuMemCreate-allocated memory, нельзя export'нуть произвольный cudaMalloc-pointer как POSIX FD) - cuframes-rtsp-source переведён на LIBRARY mode + один D2D memcpy в acquire'нутый slot (overhead малый — публишер всё равно делал такой D2D из FFmpeg hwframe pool в EXTERNAL pool раньше) Размер: granularity 2 MB на 5090 → NV12 1920×1080 (~3.1 MB) округляется до 4 MB, +1 MB на slot × 16 × 4 камеры = +64 MB VRAM. Терпимо. Packet ring (cuframes_packets://) НЕ затронут — отдельный SHM с своим magic, работает как раньше. PoC + smoke в spike/: - vmm_fd_pingpong/ — minimal cuMemCreate+FD round-trip - smoke_v04/ — full publisher+subscriber, 100/100 frames без pid share Base image: Dockerfile.runtime → CUDA 12.4 (был 13.0). Matching prod pipeline + Frigate base, иначе libcudart conflict при load. Compose stack (localhost-infra repo) — параллельный commit: - убран pid: container:cuframes-pub-parking из subscribers - image теги: gx/cuframes:0.4, gx/cuda-grid-pipeline:phase8, gx/frigate:cuframes-v0.4 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 20:13:31 +01:00
gx	d646f5a4e4	v0.3.3: consumer post-sync verify даже для v0.3 per-slot events release / build runtime Docker image (push) Failing after 0s Details release / build source tarball (push) Successful in 4s Details build / cmake build (CUDA 12.4, Ubuntu 22.04) (push) Successful in 1m41s Details build / ffmpeg filter patch (out-of-tree) (push) Successful in 1m29s Details test-u4-runner / u4 runner smoke test (push) Has been cancelled Details Bug: cudaEventRecord(event[slot]) overwrites previous state каждый publish. Когда producer wraps ring (~640ms при ring=16), event[slot] re-recorded для new content. Consumer's pending cudaStreamWaitEvent satisfied новым signal — consumer reads slot[slot_idx] thinking it's target_seq, реально получает seq+ring_size content (stale-by-1-wrap drift). После 50k+ wraps в long-running pipeline (9h uptime) drift накапливается: output stream имеет 60-70% duplicate frames (vs 10% сразу после restart). Симптом: TV picture freezes на 1-2 sec периодически. Encoder fps=25 stable (content duplicates same PTS-advance), но motion choppy на 8-9 fps real. Fix: unconditional post-sync verify (atomic re-read slot.seq после event wait). Если producer wrap occurred — slot.seq != target_seq → continue к новому target_seq. Cheap (one atomic load), correctness > perf. Verified: после deploy с fresh pipeline, 18-sec sample = 4% duplicates (vs 8.4% при том же setup но без fix). Proper v0.4 fix: per-slot+per-publish event pool с unique handle per cycle. Текущий v0.3.3 — sufficient mitigation для current production scale. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 20:27:00 +01:00
gx	becfbebc78	cuframes-rtsp-source: + --policy + --ack-timeout-ms CLI flags release / build runtime Docker image (push) Failing after 0s Details release / build source tarball (push) Successful in 2s Details build / cmake build (CUDA 12.4, Ubuntu 22.04) (push) Successful in 1m39s Details build / ffmpeg filter patch (out-of-tree) (push) Successful in 1m25s Details test-u4-runner / u4 runner smoke test (push) Has been cancelled Details Opt-in для STRICT_WAIT policy (default остаётся DROP_OLDEST). Use case STRICT_WAIT: Frame integrity критичен (e.g. recording, frame-accurate analytics). Producer ждёт ack от всех subscribers перед wrap ring → no torn frames. Trade-off: slow consumer задерживает all (default 200ms timeout затем subscriber dropped from bitmap). Use case DROP_OLDEST (default): Low-latency real-time display (TV grid). Producer wraps freely; v0.3 per-slot CUDA events закрывают race без waiting. Validation: policy=wait + ack-timeout-ms<=0 = infinite hold dead consumer — warning + force к 200ms safe default. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 08:47:14 +01:00
gx	656e36e9b0	v0.3.1: per-subscriber monitor thread — fix bitmap leak release / build runtime Docker image (push) Failing after 0s Details release / build source tarball (push) Successful in 4s Details build / cmake build (CUDA 12.4, Ubuntu 22.04) (push) Successful in 1m39s Details build / ffmpeg filter patch (out-of-tree) (push) Successful in 1m32s Details test-u4-runner / u4 runner smoke test (push) Has been cancelled Details Bug: handshake_subscriber assigned bit + activated slot но НЕ tracked client_fd. Когда subscriber container exited, socket closed on client side но producer не detected → bit оставался set forever → после 32 connections subscribe_create('cam-X'): too many subscribers (max 32). Симптом в production: каждый pipeline recreate accumulated 1 stale subscriber. После 4-5 recreate операций publishers перестали accept new pipeline → "too many subscribers" crash loop. Fix: после успешного handshake spawn detached pthread monitoring socket via blocking recv(). recv() returns 0 (EOF) когда other side closes — monitor clears bit (subscriber_bitmap &= ~(1<<bit)) + state[bit] = 0, closes fd, exits. Cost: 1 thread per active subscriber. Max 32 threads — небольшой overhead. Threads detached, no join needed. Stress test: 5x pipeline recreate без single "too many subscribers" error. Раньше: 2-3 recreate → bitmap overflow. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-24 08:00:41 +01:00
gx	8c7abbc4e8	v0.3: per-slot CUDA events — закрывает TOCTOU race без crutches release / build runtime Docker image (push) Failing after 1s Details release / build source tarball (push) Successful in 5s Details build / cmake build (CUDA 12.4, Ubuntu 22.04) (push) Successful in 1m40s Details build / ffmpeg filter patch (out-of-tree) (push) Successful in 1m22s Details test-u4-runner / u4 runner smoke test (push) Has been cancelled Details Protocol bump V2→V3: + shm header: cudaIpcEventHandle_t slot_event_handles[CUFRAMES_MAX_RING] + producer creates ring_size events (вместо одного global) + producer.do_publish records event[slot] (вместо pub->event) + consumer opens all slot events при subscribe + consumer waits event[slot_idx] specifically (вместо global producer_event) Backward compat: - Legacy pub->event сохранён + ipc_event_handle export'ится — v0.2 consumers видят его и работают по-старому (с post-sync verify hack из `517107d`). - v0.3 consumer auto-detects proto_version >= 3, fallback к legacy если cudaIpcOpenEventHandle на slot fail (graceful degradation). Effect (15-sec sample на Phase 7 single-cam, motion): v0.1 production: dup runs 34.7%, max 14 frames (560ms freeze) v0.2.1 fix: dup runs 10%, max 6, 0 back-jumps detected v0.3 per-slot: dup runs 1.9%, max 5, 3 back-jumps (likely encoder static-content artifacts, not real race) Размер shm header: 7424 → 8448 bytes (+1024 для slot_event_handles). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-22 09:23:53 +01:00
gx	517107d741	libcuframes: fix TOCTOU race в consumer slot read build / cmake build (CUDA 12.4, Ubuntu 22.04) (push) Successful in 1m34s Details build / ffmpeg filter patch (out-of-tree) (push) Successful in 1m19s Details release / build runtime Docker image (push) Failing after 1s Details release / build source tarball (push) Successful in 4s Details test-u4-runner / u4 runner smoke test (push) Has been cancelled Details Bug: producer signals один global cudaEvent для всего ring (один на producer). Consumer waits этот event после slot_seq validation, но event соответствует ПОСЛЕДНЕМУ published frame, не slot[target_seq]. Если producer wrap'нет ring во время event wait (ring=6 = 240ms окно), slot содержит уже next-gen data, consumer возвращает torn/stale frame. Симптом в production: video stream показывает «back-jump на момент» periodically — camera OSD timestamp дёргается, motion machines briefly teleport назад. cluster md5 analysis НЕ ловит (содержимое frames всё ещё unique, просто из неправильной epoch). Fix: post-sync verify. После cudaStreamWaitEvent / cudaEventSynchronize re-check slots[slot_idx].seq == target_seq. Если producer перезаписал — continue outer loop с новым target_seq. Закрывает race window между slot validation и event sync return. Остаются открытыми: - downstream GPU access после frame fill (consumer-side) — producer может wrap во время этого. Mitigation: STRICT_WAIT policy в publisher + ack discipline в consumer (cuframes_release_frame ack уже works). - bigger ring size снижает wrap frequency (240ms → 1.2s при ring=30). Test: после deploy в cuda-grid-pipeline (Phase 7 single cam), camera OSD clock больше не дёргается (раньше дёргалось каждые ~16 sec). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-21 22:27:39 +01:00
gx	4d54173bb2	roadmap: vf_cuda_grid выделен в отдельный продукт gx/vf-cuda-grid	2026-05-19 20:39:47 +01:00
gx	52fb2ad722	benchmarks: actual measured VRAM + network bandwidth (tcpdump-based) VRAM breakdown (nvidia-smi pmon): - 4 publishers = 4.4 GB (FHD + 2688x1520 ring buffers + NVDEC) - cctv-backend = 1.0 GB - frigate embeddings_manager = 1.6 GB - frigate detector:onnx = 0.6 GB - Total cuframes-stack = ~7.7 GB Network (10-sec tcpdump capture от camera subnet к R9): - Measured: 31.5 Mbps (всё включая go2rtc on-demand, ONVIF) - cuframes core: ~16 Mbps (4 publishers × main HEVC) - ONVIF/RTSP keepalives: ~1-2 Mbps - Без cuframes setup тех же 4 cam × 3 consumer был бы ~45-50 Mbps Source: production deploy 2026-05-19 measurement.	2026-05-19 19:22:53 +01:00
gx	3779175737	docs(benchmarks): production v0.2 deploy metrics (4 cam × 3 consumer) Real-world numbers с production deploy 2026-05-19: - RTSP к камерам: 12 → 4 (−67%) - NVDEC sessions: 8 → 4 (−50%) - Camera bandwidth: 34 → 16 Mbps (−54%) - PCIe D2H copies: 346 MB/s → ~0 (−100% через zero-copy CUDA IPC) - Frigate прямые RTSP: 8 → 0 (−100%) Plus live nvidia-smi metrics, что сохранилось vs не сэкономлено, projection table для других setup'ов (8/16 cam × 2/3/4 consumer). Для promotional material — public-facing claims на основе measured deploy.	2026-05-19 19:07:16 +01:00