mtmd adds video input support in llama.cpp

Original: mtmd : add video input support by ngxson · Pull Request #24269 · ggml-org/llama.cpp

llama.cpp merged mtmd video input support for CLI, chat completions, and web UI multimodal use.

ggml-org/llama.cpp merged PR #24269, adding video input support to mtmd through mtmd-cli and /chat/completions, which also enables the web UI path. The implementation invokes a locally installed ffmpeg subprocess instead of bundling codec support, and currently extracts visual frames only, with no audio support yet. It was tested with Qwen3-VL-2B in CLI and Gemma 4 E4B in web UI, making local multimodal video experiments more accessible.

ggml-org/llama.cpp 在 2026 年 6 月 8 日合併了 PR #24269，為 mtmd 加入影片輸入支援。這代表 llama.cpp 的多模態處理路徑不再只限於圖片或其他單一媒體檔，而是可以把影片作為輸入來源，讓支援視覺理解的模型，例如 Reddit 貼文中提到的 Gemma 與 Qwen，開始針對影片內容進行回答。PR 目標明確列出兩個入口：mtmd-cli 可以直接吃影片檔，/chat/completions 也可接受影片輸入，因此 web UI 會自動受益。

Full summary

Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.

See Pro plans →

Summaries are AI-generated; the original article is authoritative.