Resilience: обрыв любого input не должен ломать output #3

Open
opened 2026-05-21 09:54:28 +01:00 by gx · 0 comments
Owner

Контекст

Сегодня (2026-05-21) выяснилось: audio sidecar потерял connection с mediamtx (ffmpeg RTSP output без auto-reconnect). Pipeline depends на live-audio → 404 → restart loop → весь output потерян.

Аналогично для video: если один cuframes input получит EOF (publisher container crashed или RTSP camera упала), pipeline ffmpeg cuframes:// demuxer возвращает EOF → весь pipeline EOF → output вырубается.

Это unacceptable для production. Должно быть так:

  • Один video stream упал → cell показывает "NO SIGNAL" placeholder + event stream.lost/<cam> в MQTT. При восстановлении — stream.restored/<cam> + cell снова normal.
  • Audio stream упал → icon overlay "🔇 audio unavailable" + event. При восстановлении — icon убирается.
  • Output stream к TV — никогда не прерывается (постоянно работает либо с placeholders).

Архитектурные варианты

A. Separate ffmpeg per camera + composer

4 ffmpeg processes (один на camera): cuframes/RTSP → local UDP MPEGTS. Composer ffmpeg читает 4 UDP + composes. Docker watchdog restart per camera.

Resilience Per-camera ffmpeg crash → только её UDP stops, composer работает
Placeholder logic Composer fallback к lavfi color=black при UDP timeout
GPU cost 4× extra decode-encode (cuframes → MPEGTS → composer decode)
Complexity Medium (4 services + composer)

Cons: дополнительный transcoding round-trip убивает zero-copy advantage cuframes.

B. Watchdog + auto-reconnect в текущем pipeline

Controller monitors mediamtx paths + cuframes publishers. При detected stream loss → ZMQ-команда filter'у показать placeholder + emit MQTT event. При restore → убрать placeholder.

Pipeline ffmpeg cuframes demuxer should support reconnect (currently не уверен). Без demuxer reconnect — pipeline restart нужен.

Resilience Limited — без demuxer reconnect = pipeline restart still needed
Placeholder logic Controller-driven (icon overlays)
GPU cost 0× extra
Complexity Medium (controller monitoring)

Cons: cuframes demuxer не reconnects natively. Pipeline restart cycle всё ещё нужен при некоторых failures.

C. Filter-level resilience (proper)

vf_cuda_grid filter sam detects "no frame from pad N" → renders cell с no_signal_<cam>.png placeholder + emits AVFrame side data event. Каждый input pad имеет state (alive/dead).

Demuxer change: cuframes:// сам produces "stale frame" frames при IPC timeout (например last received frame + watermark "stale 5s ago"). Pipeline never EOFs.

Similarly audio sidecaramix с lavfi anullsrc как always-running fallback input. Если real radio dies → silent fallback, sidecar не падает.

Controller monitors stream states (via MQTT events from filter side data) → toggles icon overlays.

Resilience Maximum — output никогда не EOFs
Placeholder logic Filter-side rendering placeholder сам
GPU cost 0× extra
Complexity High — filter rework + cuframes demuxer change
Dev cost ~3-4 дня (filter + cuframes + controller monitoring)

D. Hybrid (recommended)

Combine B (audio sidecar с lavfi fallback) + partial C (filter knows about EOF and renders placeholder, but doesn't require demuxer rework — uses EXT_INFINITY framesync behaviour + Y/UV plane fill).

framesync уже supports EXT_INFINITY — repeats last frame indefinitely. Если input dies, filter keeps repeating last frame. Не идеально (frozen video instead "NO SIGNAL"), but output continues.

Improvement: filter detects "frame timestamp not advancing > 5 sec" → switches к placeholder rendering (built-in NO SIGNAL banner over frozen frame).

Audio: sidecar amix=duration=longest+anullsrc always-on input. Real radio drops? amix continues с silence.

Resilience Good — output never EOFs, degraded UX при stream loss
Placeholder logic Filter detects stale + renders banner
GPU cost 0× extra
Complexity Medium (filter additional logic + audio sidecar config)
Dev cost ~1-2 дня

Quick wins (применимы независимо от выбора)

  1. Audio sidecar — lavfi anullsrc как amix input (silence fallback)

    [0:a][1:a][2:a]astreamselect[music]
    [music][3:a][lavfi anullsrc]amix=inputs=3:duration=longest
    

    Если real audio пропал — silence в output, pipeline не EOFs.

  2. Pipeline ffmpeg input reconnect flags для RTSP loopback от mediamtx:

    -rtsp_transport tcp -reconnect 1 -reconnect_streamed 1 -reconnect_delay_max 5
    -i rtsp://cuda-grid-mediamtx:8554/live-audio
    
  3. Controller watchdog — periodic check mediamtx /v3/paths/list (либо ffprobe). Если live-audio path исчез > 10 sec → emit MQTT event + dispatch icon overlay "audio offline".

  4. MQTT events для stream state:

    • cuda_grid/event/<inst>/stream_lost { cam: "gate_lpr" }
    • cuda_grid/event/<inst>/stream_restored { cam: "gate_lpr" }

Рекомендация

Phase 1 (immediate, no filter dev): quick wins 1-4 — реализуемо на controller + audio sidecar только. Закрывает ~80% production failures (network glitches, radio drops, mediamtx hiccups).

Phase 2 (medium-term): Вариант D — filter detects stale frames + renders placeholder. Требует filter dev параллельно с #2 layout redesign.

Phase 3 (long-term, перед production): full Вариант C — demuxer rework для true resilience.

Related

  • #2 (production GPU efficiency) — filter rework там и тут пересекается. Если делать #2 Вариант 3 — добавить resilience features туда же.
  • Today's incident: audio sidecar RTSP output потерял connection после ~3h, pipeline пытался reconnect к live-audio но 404 → loop. Manual restart fixed.
## Контекст Сегодня (2026-05-21) выяснилось: audio sidecar потерял connection с mediamtx (ffmpeg RTSP output без auto-reconnect). Pipeline depends на `live-audio` → 404 → restart loop → весь output потерян. Аналогично для video: если один cuframes input получит EOF (publisher container crashed или RTSP camera упала), pipeline ffmpeg cuframes:// demuxer возвращает EOF → весь pipeline EOF → output вырубается. **Это unacceptable для production.** Должно быть так: - **Один video stream упал** → cell показывает "NO SIGNAL" placeholder + event `stream.lost/<cam>` в MQTT. При восстановлении — `stream.restored/<cam>` + cell снова normal. - **Audio stream упал** → icon overlay "🔇 audio unavailable" + event. При восстановлении — icon убирается. - **Output stream к TV** — никогда не прерывается (постоянно работает либо с placeholders). ## Архитектурные варианты ### A. Separate ffmpeg per camera + composer 4 ffmpeg processes (один на camera): cuframes/RTSP → local UDP MPEGTS. Composer ffmpeg читает 4 UDP + composes. Docker watchdog restart per camera. | | | |---|---| | Resilience | Per-camera ffmpeg crash → только её UDP stops, composer работает | | Placeholder logic | Composer fallback к `lavfi color=black` при UDP timeout | | GPU cost | 4× extra decode-encode (cuframes → MPEGTS → composer decode) | | Complexity | Medium (4 services + composer) | **Cons:** дополнительный transcoding round-trip убивает zero-copy advantage cuframes. ### B. Watchdog + auto-reconnect в текущем pipeline Controller monitors mediamtx paths + cuframes publishers. При detected stream loss → ZMQ-команда filter'у показать placeholder + emit MQTT event. При restore → убрать placeholder. Pipeline ffmpeg cuframes demuxer should support reconnect (currently не уверен). Без demuxer reconnect — pipeline restart нужен. | | | |---|---| | Resilience | Limited — без demuxer reconnect = pipeline restart still needed | | Placeholder logic | Controller-driven (icon overlays) | | GPU cost | 0× extra | | Complexity | Medium (controller monitoring) | **Cons:** cuframes demuxer не reconnects natively. Pipeline restart cycle всё ещё нужен при некоторых failures. ### C. Filter-level resilience (proper) `vf_cuda_grid` filter sam detects "no frame from pad N" → renders cell с `no_signal_<cam>.png` placeholder + emits AVFrame side data event. Каждый input pad имеет state (alive/dead). Demuxer change: `cuframes://` сам produces "stale frame" frames при IPC timeout (например last received frame + watermark "stale 5s ago"). Pipeline never EOFs. Similarly **audio sidecar** — `amix` с lavfi `anullsrc` как always-running fallback input. Если real radio dies → silent fallback, sidecar не падает. Controller monitors stream states (via MQTT events from filter side data) → toggles icon overlays. | | | |---|---| | Resilience | **Maximum** — output никогда не EOFs | | Placeholder logic | Filter-side rendering placeholder сам | | GPU cost | 0× extra | | Complexity | **High** — filter rework + cuframes demuxer change | | Dev cost | ~3-4 дня (filter + cuframes + controller monitoring) | ### D. Hybrid (recommended) Combine **B** (audio sidecar с lavfi fallback) + partial **C** (filter knows about EOF and renders placeholder, but doesn't require demuxer rework — uses `EXT_INFINITY` framesync behaviour + Y/UV plane fill). framesync уже supports `EXT_INFINITY` — repeats last frame indefinitely. Если input dies, filter keeps repeating last frame. Не идеально (frozen video instead "NO SIGNAL"), but output continues. Improvement: filter detects "frame timestamp not advancing > 5 sec" → switches к placeholder rendering (built-in NO SIGNAL banner over frozen frame). Audio: sidecar `amix=duration=longest`+`anullsrc` always-on input. Real radio drops? amix continues с silence. | | | |---|---| | Resilience | Good — output never EOFs, degraded UX при stream loss | | Placeholder logic | Filter detects stale + renders banner | | GPU cost | 0× extra | | Complexity | Medium (filter additional logic + audio sidecar config) | | Dev cost | ~1-2 дня | ## Quick wins (применимы независимо от выбора) 1. **Audio sidecar — lavfi `anullsrc` как amix input** (silence fallback) ``` [0:a][1:a][2:a]astreamselect[music] [music][3:a][lavfi anullsrc]amix=inputs=3:duration=longest ``` Если real audio пропал — silence в output, pipeline не EOFs. 2. **Pipeline ffmpeg input reconnect flags** для RTSP loopback от mediamtx: ``` -rtsp_transport tcp -reconnect 1 -reconnect_streamed 1 -reconnect_delay_max 5 -i rtsp://cuda-grid-mediamtx:8554/live-audio ``` 3. **Controller watchdog** — periodic check mediamtx /v3/paths/list (либо ffprobe). Если live-audio path исчез > 10 sec → emit MQTT event + dispatch icon overlay "audio offline". 4. **MQTT events** для stream state: - `cuda_grid/event/<inst>/stream_lost { cam: "gate_lpr" }` - `cuda_grid/event/<inst>/stream_restored { cam: "gate_lpr" }` ## Рекомендация **Phase 1** (immediate, no filter dev): quick wins 1-4 — реализуемо на controller + audio sidecar только. Закрывает ~80% production failures (network glitches, radio drops, mediamtx hiccups). **Phase 2** (medium-term): Вариант D — filter detects stale frames + renders placeholder. Требует filter dev параллельно с #2 layout redesign. **Phase 3** (long-term, перед production): full Вариант C — demuxer rework для true resilience. ## Related - #2 (production GPU efficiency) — filter rework там и тут пересекается. Если делать #2 Вариант 3 — добавить resilience features туда же. - Today's incident: audio sidecar RTSP output потерял connection после ~3h, pipeline пытался reconnect к live-audio но 404 → loop. Manual restart fixed.
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: gx/vf-cuda-grid#3