Latest in AI

Showing:multimodalDevelopersClear ×

← Home

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

Mistral AI Launches Voxtral: Audio Speech and Understanding Model
Mistral AI News40 days agoRelease
Mistral AI has announced Voxtral, its debut audio-native language model family targeting speech recognition, multilingual transcription, and audio comprehension. Available in two sizes via Mistral's La Plateforme API, it extends the company's portfolio decisively into multimodal AI. The release positions Mistral as a full-stack AI provider capable of handling voice and audio alongside its established text and code capabilities.
MolmoMotion: Language-Guided 3D Motion Forecasting
Hugging Face Blog40 days agoPaper
Allen Institute for AI has released MolmoMotion, a new model that adds language-guided 3D motion forecasting to the open-source Molmo family. By conditioning spatial trajectory predictions on natural language, the system enables more flexible, human-interpretable motion anticipation. The work targets applications in robotics, video understanding, and embodied AI where predicting movement in 3D space is safety-critical or operationally essential.
NVIDIA Releases NVFP4-Quantized DiffusionGemma 26B A4B IT on Hugging Face
r/LocalLLaMA top day47 days agoRelease
NVIDIA has released DiffusionGemma 26B A4B IT NVFP4 on Hugging Face, a quantized version of Google DeepMind's open-weights multimodal model. Built on a Mixture-of-Experts architecture with 25.2B total but only 3.8B active parameters, it generates text in parallel 256-token blocks using discrete diffusion, exceeding 1,100 tokens per second on H100 hardware. The model supports a 256K-token context, text/image/video inputs, native function calling, reasoning mode, and 35+ languages.
Lemonade v10.7 Adds Omni Models, Benchmarks, and Cross-Vendor GPU Support
r/LocalLLaMA top day47 days agoRelease
Lemonade v10.7 marks a project-level shift toward working-group-driven development, with 19 contributors involved in the release. The update improves LMX-Omni virtual models for Open WebUI and OpenAI-compatible multimedia clients, introduces the `lemonade bench` CLI, and expands backend support. CUDA, Vulkan, llama.cpp, stable-diffusion.cpp, FastFlowLM, and vLLM are part of the broader push toward cross-vendor local AI performance.
SenseNova U1 Adds an Infographic-Specific Fine-Tune
r/LocalLLaMA top day47 days agoRelease
A Reddit post highlights a new infographic-specific fine-tune for SenseNova U1-8B-MoT, trained with an extended multi-task phase for structured visual output. The reported benchmarks show large gains in IGenBench infographic accuracy and chart understanding, with smaller improvement in text rendering. Aesthetic score appears roughly unchanged, suggesting the update mainly improves information structure and visual reasoning rather than overall visual polish.
Gemma 4 12B Unified Audio Loses Speech Attention with Large System Prompts
r/LocalLLaMA top day48 days agoCommentary
A developer building a single-pass voice assistant with Gemma 4 12B unified (encoder-free audio/vision/text model) finds that audio attention collapses once the system prompt grows to ~21k tokens. The model then ignores or hallucinates instead of responding to the spoken input. The issue reproduces identically on vLLM, llama.cpp, and LiteRT-LM, pointing to an architectural attention-saturation limit rather than a stack-specific bug.
Exif Smuggling: PoC for Hiding Malicious Prompts in Image EXIF Metadata
Hacker News (AI keywords)48 days agoIncident
Exif Smuggling is a security PoC showing how attackers can embed hidden instructions in image EXIF metadata fields to perform indirect prompt injection against vision-capable AI models. When AI systems parse images alongside their metadata, embedded malicious text may be processed as legitimate instructions, bypassing standard input filters. Developers building AI apps with image upload features should strip or sanitize EXIF data before passing content to language models.
Google announces Gemini 3.5 Live Translate for instant voice-to-voice translation
Ars Technica AI48 days agoNew Tool
Google has announced Gemini 3.5 Live Translate, a real-time voice-to-voice translation system that preserves the original speaker's tone, pacing, and pitch rather than producing flat synthetic output. The system embeds Google's SynthID watermarks into translated audio, enabling AI content provenance detection without affecting audio quality. This extends Google's Gemini Live multimodal API capabilities into cross-language communication scenarios such as meetings, live streams, and customer service.
Google Introduces Gemma 4 12B: A Unified, Encoder-Free Multimodal Model★ 85
Google DeepMind Blog48 days agoRelease
Google DeepMind has unveiled Gemma 4 12B, a next-generation open-weights model featuring a unified, encoder-free multimodal architecture. By eliminating the traditional separate vision encoder (such as ViT), it processes diverse modalities directly within a single Transformer network. This design simplifies training, reduces inference latency, and enhances cross-modal alignment, marking a significant milestone for open-source AI.
mtmd adds video input support in llama.cpp★ 72
r/LocalLLaMA top day49 days agoRelease
ggml-org/llama.cpp merged PR #24269, adding video input support to mtmd through mtmd-cli and /chat/completions, which also enables the web UI path. The implementation invokes a locally installed ffmpeg subprocess instead of bundling codec support, and currently extracts visual frames only, with no audio support yet. It was tested with Qwen3-VL-2B in CLI and Gemma 4 E4B in web UI, making local multimodal video experiments more accessible.
Introducing Mistral 3★ 84
Mistral AI News50 days agoRelease
Mistral AI introduced Mistral 3, a new open model family under Apache 2.0. It includes Mistral Large 3, a 675B-parameter sparse MoE with 41B active parameters, plus Ministral 3 models at 3B, 8B, and 14B. The release targets frontier open-weight use, multimodal and multilingual workflows, enterprise customization, and efficient local or edge deployments.
Introducing Mistral Small 4★ 76
Mistral AI News50 days agoRelease
Mistral AI introduced Mistral Small 4 as the next major release in the Mistral Small family. It combines reasoning, multimodal, and agentic coding capabilities into one open model with configurable reasoning effort. The model uses a MoE architecture, supports a 256k context window and text-image inputs, and is available through Mistral API, AI Studio, Hugging Face, NVIDIA NIM, and common inference stacks.
Introducing Mistral 3★ 78
Mistral AI News50 days agoRelease
Mistral AI introduced Mistral 3, a new open model family including Mistral Large 3 and Ministral 3 models at 3B, 8B, and 14B sizes. Large 3 is a 675B-parameter sparse MoE model with 41B active parameters, while Ministral 3 targets local and edge use cases. The models are released under Apache 2.0 and are available through Mistral AI Studio, Hugging Face, Amazon Bedrock, and other platforms.
Introducing Mistral Small 4★ 78
Mistral AI News50 days agoRelease
Mistral Small 4 is the next major release in the Mistral Small family, unifying Magistral-style reasoning, Pixtral-style multimodality, and Devstral-style coding agents. It uses a MoE architecture with 119B total parameters, 6B active parameters per token, a 256k context window, and configurable reasoning effort. The model is available via Mistral API, AI Studio, Hugging Face, open-source serving stacks, and NVIDIA deployment options.
Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI
Hugging Face Blog53 days agoRelease
NVIDIA’s Nemotron 3.5 Content Safety is positioned as a customizable multimodal safety layer for global enterprise AI. Based on the title, it appears focused on content moderation and policy enforcement across AI applications, potentially including text and visual contexts. Without the full article, details such as benchmarks, licensing, supported languages, deployment paths, and model specifications should not be assumed.
Google's Gemma 4 12B is designed to run on 16GB RAM laptops
Ars Technica AI54 days agoRelease
Google introduced Gemma 4 12B, an open model aimed at running locally on laptops with 16GB of RAM. The model uses a new encoding scheme and token prediction to improve efficiency relative to its size. Its practical importance depends on real-world benchmarks, but it could lower the barrier for private, offline, and local multimodal AI workflows.
神秘 AI 新創 Hark 完成 7 億美元 A 輪融資，打造「通用」AI 介面與專屬硬體★ 75
TechCrunch AI67 days agoBusiness
The mysterious AI startup Hark has announced the successful completion of a Series A funding round totaling $700 million (approximately NT$22 billion), capital…
Google I/O 2026 重磅發布：Gemini 3.5 Flash、Omni (NanoBanana 影片模型)、Spark 背景 Agent 與 Antigravity 2.0★ 85
Latent Space69 days agoRelease
In the latest issue of Latent Space AINews, the major announcements from Google I/O 2026 were covered in depth. Google demonstrated its formidable R&D and…
llm-gemini 0.32 釋出：命令列工具正式支援全新 Gemini 3.5 Flash 模型
Simon Willison's Weblog69 days agoRelease
Well-known open-source developer Simon Willison has announced the release of version 0.32 of `llm-gemini`, the dedicated plugin for his command-line LLM tool…
Google DeepMind 發表 Gemini Omni：全新原生全模態模型，實現超低延遲即時影音與語音互動★ 95
Google DeepMind Blog71 days agoRelease
Google DeepMind has officially unveiled its latest flagship AI model, "Gemini Omni." This model represents a major breakthrough by Google in the field of…
Gemini for Science：Google DeepMind 推出全新科學 AI 工具與實驗，開啟探索新紀元★ 85
Google DeepMind Blog71 days agoRelease
Google DeepMind has unveiled a new initiative called "Gemini for Science" — a collection of AI tools and experiments designed to expand the scale and precision…
NVIDIA 推出 Nemotron 3 Nano Omni：支援長文本的多模態智慧模型，專為文件、語音與影片 Agent 設計★ 75
Hugging Face Blog90 days agoRelease
NVIDIA has officially launched a new lightweight multimodal model, "Nemotron 3 Nano Omni." This model is designed to deliver powerful multimodal intelligence…
使用 Sentence Transformers 訓練與微調多模態嵌入與 Reranker 模型★ 80
Hugging Face Blog103 days agoTutorial
As multimodal AI has become widespread, integrating data from different modalities — text, images, and more — into a single vector space and performing…
Sentence Transformers 推出多模態嵌入與重排（Reranker）模型支援★ 78
Hugging Face Blog110 days agoRelease
The popular open-source library `sentence-transformers` from Hugging Face has received a major update, officially introducing native support for Multimodal…
Google 發表 Gemma 4：專為裝置端設計的前沿多模態開放模型★ 85
Hugging Face Blog117 days agoRelease
Google and Hugging Face have jointly announced a new generation of open-weight models — "Gemma 4." This model represents a major breakthrough in on-device AI…
TII 推出全新 Falcon Perception 多模態感知模型★ 75
Hugging Face Blog118 days agoRelease
The Technology Innovation Institute (TII) of the UAE has officially announced the launch of its new "Falcon Perception" model on the Hugging Face blog. As an…
Vercel AI Gateway 現已支援 GLM 5V Turbo 多模態模型
Vercel Changelog118 days agoRelease
Vercel has announced in its product Changelog that the Vercel AI Gateway now officially supports the GLM 5V Turbo model. GLM 5V Turbo is a high-performance…
IBM 推出 Granite 4.0 3B Vision：專為企業文件設計的輕量級多模態 AI 模型★ 75
Hugging Face Blog118 days agoRelease
IBM has officially launched its new lightweight multimodal model on Hugging Face — the Granite 4.0 3B Vision. With 3 billion (3B) parameters, this model is…
Hugging Face 開源生態報告：2026 春季版★ 85
Hugging Face Blog132 days agoCommentary
Hugging Face has published its Spring 2026 "State of Open Source AI" report, offering a comprehensive review of the explosive growth and paradigm shifts that…
Holotron-12B：高吞吐量電腦操作（Computer Use）AI 代理模型發布★ 75
Hugging Face Blog133 days agoRelease
Hcompany has officially released a new model on Hugging Face called **Holotron-12B**, positioned as a "High Throughput Computer Use Agent." Although only the…

Page 1Next →

Latest in AI

Mistral AI Launches Voxtral: Audio Speech and Understanding Model

MolmoMotion: Language-Guided 3D Motion Forecasting

NVIDIA Releases NVFP4-Quantized DiffusionGemma 26B A4B IT on Hugging Face

Lemonade v10.7 Adds Omni Models, Benchmarks, and Cross-Vendor GPU Support

SenseNova U1 Adds an Infographic-Specific Fine-Tune

Gemma 4 12B Unified Audio Loses Speech Attention with Large System Prompts

Exif Smuggling: PoC for Hiding Malicious Prompts in Image EXIF Metadata

Google announces Gemini 3.5 Live Translate for instant voice-to-voice translation

Google Introduces Gemma 4 12B: A Unified, Encoder-Free Multimodal Model★ 85

mtmd adds video input support in llama.cpp★ 72

Introducing Mistral 3★ 84

Introducing Mistral Small 4★ 76

Introducing Mistral 3★ 78

Introducing Mistral Small 4★ 78

Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI

Google's Gemma 4 12B is designed to run on 16GB RAM laptops

神秘 AI 新創 Hark 完成 7 億美元 A 輪融資，打造「通用」AI 介面與專屬硬體★ 75

Google I/O 2026 重磅發布：Gemini 3.5 Flash、Omni (NanoBanana 影片模型)、Spark 背景 Agent 與 Antigravity 2.0★ 85

llm-gemini 0.32 釋出：命令列工具正式支援全新 Gemini 3.5 Flash 模型

Google DeepMind 發表 Gemini Omni：全新原生全模態模型，實現超低延遲即時影音與語音互動★ 95

Gemini for Science：Google DeepMind 推出全新科學 AI 工具與實驗，開啟探索新紀元★ 85

NVIDIA 推出 Nemotron 3 Nano Omni：支援長文本的多模態智慧模型，專為文件、語音與影片 Agent 設計★ 75

使用 Sentence Transformers 訓練與微調多模態嵌入與 Reranker 模型★ 80

Sentence Transformers 推出多模態嵌入與重排（Reranker）模型支援★ 78

Google 發表 Gemma 4：專為裝置端設計的前沿多模態開放模型★ 85

TII 推出全新 Falcon Perception 多模態感知模型★ 75

Vercel AI Gateway 現已支援 GLM 5V Turbo 多模態模型

IBM 推出 Granite 4.0 3B Vision：專為企業文件設計的輕量級多模態 AI 模型★ 75

Hugging Face 開源生態報告：2026 春季版★ 85

Holotron-12B：高吞吐量電腦操作（Computer Use）AI 代理模型發布★ 75