mtmd adds video input support in llama.cpp
Original: mtmd : add video input support by ngxson · Pull Request #24269 · ggml-org/llama.cpp
llama.cpp merged mtmd video input support for CLI, chat completions, and web UI multimodal use.
ggml-org/llama.cpp merged PR #24269, adding video input support to mtmd through mtmd-cli and /chat/completions, which also enables the web UI path. The implementation invokes a locally installed ffmpeg subprocess instead of bundling codec support, and currently extracts visual frames only, with no audio support yet. It was tested with Qwen3-VL-2B in CLI and Gemma 4 E4B in web UI, making local multimodal video experiments more accessible.
ggml-org/llama.cpp 在 2026 年 6 月 8 日合併了 PR #24269,為 mtmd 加入影片輸入支援。這代表 llama.cpp 的多模態處理路徑不再只限於圖片或其他單一媒體檔,而是可以把影片作為輸入來源,讓支援視覺理解的模型,例如 Reddit 貼文中提到的 Gemma 與 Qwen,開始針對影片內容進行回答。PR 目標明確列出兩個入口:mtmd-cli 可以直接吃影片檔,/chat/completions 也可接受影片輸入,因此 web UI 會自動受益。
Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.
See Pro plans →Want the original English / full article?
Read on r/LocalLLaMA top day →Summaries are AI-generated; the original article is authoritative.