Resilience: обрыв любого input не должен ломать output #3
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Контекст
Сегодня (2026-05-21) выяснилось: audio sidecar потерял connection с mediamtx (ffmpeg RTSP output без auto-reconnect). Pipeline depends на
live-audio→ 404 → restart loop → весь output потерян.Аналогично для video: если один cuframes input получит EOF (publisher container crashed или RTSP camera упала), pipeline ffmpeg cuframes:// demuxer возвращает EOF → весь pipeline EOF → output вырубается.
Это unacceptable для production. Должно быть так:
stream.lost/<cam>в MQTT. При восстановлении —stream.restored/<cam>+ cell снова normal.Архитектурные варианты
A. Separate ffmpeg per camera + composer
4 ffmpeg processes (один на camera): cuframes/RTSP → local UDP MPEGTS. Composer ffmpeg читает 4 UDP + composes. Docker watchdog restart per camera.
lavfi color=blackпри UDP timeoutCons: дополнительный transcoding round-trip убивает zero-copy advantage cuframes.
B. Watchdog + auto-reconnect в текущем pipeline
Controller monitors mediamtx paths + cuframes publishers. При detected stream loss → ZMQ-команда filter'у показать placeholder + emit MQTT event. При restore → убрать placeholder.
Pipeline ffmpeg cuframes demuxer should support reconnect (currently не уверен). Без demuxer reconnect — pipeline restart нужен.
Cons: cuframes demuxer не reconnects natively. Pipeline restart cycle всё ещё нужен при некоторых failures.
C. Filter-level resilience (proper)
vf_cuda_gridfilter sam detects "no frame from pad N" → renders cell сno_signal_<cam>.pngplaceholder + emits AVFrame side data event. Каждый input pad имеет state (alive/dead).Demuxer change:
cuframes://сам produces "stale frame" frames при IPC timeout (например last received frame + watermark "stale 5s ago"). Pipeline never EOFs.Similarly audio sidecar —
amixс lavfianullsrcкак always-running fallback input. Если real radio dies → silent fallback, sidecar не падает.Controller monitors stream states (via MQTT events from filter side data) → toggles icon overlays.
D. Hybrid (recommended)
Combine B (audio sidecar с lavfi fallback) + partial C (filter knows about EOF and renders placeholder, but doesn't require demuxer rework — uses
EXT_INFINITYframesync behaviour + Y/UV plane fill).framesync уже supports
EXT_INFINITY— repeats last frame indefinitely. Если input dies, filter keeps repeating last frame. Не идеально (frozen video instead "NO SIGNAL"), but output continues.Improvement: filter detects "frame timestamp not advancing > 5 sec" → switches к placeholder rendering (built-in NO SIGNAL banner over frozen frame).
Audio: sidecar
amix=duration=longest+anullsrcalways-on input. Real radio drops? amix continues с silence.Quick wins (применимы независимо от выбора)
Audio sidecar — lavfi
anullsrcкак amix input (silence fallback)Если real audio пропал — silence в output, pipeline не EOFs.
Pipeline ffmpeg input reconnect flags для RTSP loopback от mediamtx:
Controller watchdog — periodic check mediamtx /v3/paths/list (либо ffprobe). Если live-audio path исчез > 10 sec → emit MQTT event + dispatch icon overlay "audio offline".
MQTT events для stream state:
cuda_grid/event/<inst>/stream_lost { cam: "gate_lpr" }cuda_grid/event/<inst>/stream_restored { cam: "gate_lpr" }Рекомендация
Phase 1 (immediate, no filter dev): quick wins 1-4 — реализуемо на controller + audio sidecar только. Закрывает ~80% production failures (network glitches, radio drops, mediamtx hiccups).
Phase 2 (medium-term): Вариант D — filter detects stale frames + renders placeholder. Требует filter dev параллельно с #2 layout redesign.
Phase 3 (long-term, перед production): full Вариант C — demuxer rework для true resilience.
Related