Latest in AI

Showing:ResearchersOpen-sourceClear ×

← Home

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

Benchmarking Google Eloquent Exposes Major On-Device Dictation Reliability Issues
r/LocalLLaMA top day47 days agoBenchmark
A LocalLLaMA user tried to benchmark Google’s new fully local dictation app, Eloquent, against open ASR models such as Qwen3-ASR and NVIDIA Parakeet V3. The tester reported that roughly half of dictations returned only fragments, even during manual use. When Eloquent produced complete transcripts, its word error rate was competitive, but the missing-output behavior made the app unreliable for evaluation and practical use.
DiffusionGemma: Google Launches High-Speed Open-Weight Gemma Diffusion Model★ 76
Simon Willison's Weblog47 days agoRelease
Simon Willison highlights Google’s new DiffusionGemma, an Apache 2 licensed open-weight Gemma model. He connects it to last year’s brief Gemini Diffusion preview, which he measured at 857 tokens per second. NVIDIA is currently hosting the model for free on its NIM cloud API, where Willison generated 2,409 tokens in 4.4 seconds, implying at least 500 tokens per second.
Google DeepMind Releases DiffusionGemma: Open Source Model with 4x Local AI Execution Speed Improvement
Ars Technica AI47 days agoRelease
Google DeepMind has released DiffusionGemma, an open-source model that brings diffusion-based generation to text tasks. Unlike autoregressive LLMs that generate one token at a time, diffusion models can produce outputs in parallel, dramatically cutting latency. The result is reportedly a 4x speed improvement for local AI inference, making on-device deployment significantly more practical.
LocalLLaMA User Weighs QAT Gemma 31B GGUF Quants for RTX 3060
r/LocalLLaMA top day47 days agoCommentary
A Reddit user with an RTX 3060 12GB and 32GB DDR3 RAM is evaluating new QAT-based Gemma 31B GGUF quantizations. They currently run an older Unsloth Gemma 31B IQ3_XXS build at long context, with some tensor and mmproj offloading to CPU. The post asks which Q2-Q3 quant to choose, whether QAT changes quality expectations, and whether MTP would help or hurt under tight VRAM limits.
πfs: the data-free filesystem that “stores” data in π
Hacker News (AI keywords)47 days agoNew Tool
πfs is an open-source FUSE-style filesystem built around a deliberately absurd idea: data does not need to be stored if it can be located in pi. It records metadata such as file names and positions in pi, then reconstructs content from those locations. The project is more technical humor and conceptual demonstration than practical storage or AI tooling.
llama.cpp Merges MTP Optimization Removing Padding and Extra D2D Copies
r/LocalLLaMA top day47 days agoRelease
llama.cpp merged PR #24086, which changes ggml_gated_delta_net so MTP passes snapshot count K as an operation parameter instead of deriving it from tensor shape. The change removes a padding workaround and copies emitted snapshots into the recurrent cache with a single strided ggml_cpy. Benchmarks on DGX Spark with Qwen3.6-35B-A3B-UD-Q4_K_M.gguf showed about a 4% throughput gain, with wall time falling from 21.71s to 20.91s.
DiffusionGemma: 4x faster text generation★ 74
Google DeepMind Blog47 days agoRelease
Google’s DiffusionGemma is an Apache 2.0 experimental open model using text diffusion instead of standard autoregressive decoding. The 26B MoE model activates 3.8B parameters during inference and is designed for low-latency local workflows. Google claims up to 4x faster generation on dedicated GPUs, while noting that output quality is below standard Gemma 4 and production-quality use cases should still prefer Gemma 4.
Lemonade v10.7 Adds Omni Models, Benchmarks, and Cross-Vendor GPU Support
r/LocalLLaMA top day47 days agoRelease
Lemonade v10.7 marks a project-level shift toward working-group-driven development, with 19 contributors involved in the release. The update improves LMX-Omni virtual models for Open WebUI and OpenAI-compatible multimedia clients, introduces the `lemonade bench` CLI, and expands backend support. CUDA, Vulkan, llama.cpp, stable-diffusion.cpp, FastFlowLM, and vLLM are part of the broader push toward cross-vendor local AI performance.
NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI
NVIDIA Blog47 days agoRelease
Google DeepMind released DiffusionGemma, an experimental open model built for fast text generation. NVIDIA says it optimized the model for GeForce RTX GPUs, RTX PRO platforms, and DGX Spark systems. Instead of generating text one word at a time, DiffusionGemma produces multiple words in parallel to reduce latency for single-user workloads.
DiffusionGemma: The Developer Guide — Google Developers Blog
r/LocalLLaMA top day47 days agoTutorial
Google has released a comprehensive developer guide for DiffusionGemma, a text-generation model that uses masked diffusion rather than autoregressive next-token prediction. Unlike standard Gemma models, DiffusionGemma iteratively denoises a fully masked sequence to produce output, enabling a fundamentally different generation paradigm. The guide targets developers looking to integrate or experiment with diffusion-based LLMs using Google's tooling.
DiffusionGemma: 4x Faster Text Generation★ 76
Hacker News (AI keywords)47 days agoRelease
Google released DiffusionGemma, a 26B MoE experimental open model using text diffusion instead of token-by-token autoregressive decoding. It can generate blocks of text in parallel, reaching up to 4x faster output on dedicated GPUs. The model targets local, speed-sensitive workflows, but Google says its output quality is below standard Gemma 4 and recommends Gemma 4 for quality-critical production use.
SenseNova U1 Adds an Infographic-Specific Fine-Tune
r/LocalLLaMA top day47 days agoRelease
A Reddit post highlights a new infographic-specific fine-tune for SenseNova U1-8B-MoT, trained with an extended multi-task phase for structured visual output. The reported benchmarks show large gains in IGenBench infographic accuracy and chart understanding, with smaller improvement in text rendering. Aesthetic score appears roughly unchanged, suggesting the update mainly improves information structure and visual reasoning rather than overall visual polish.
Bonsai LM 1-bit and 1.58-bit Benchmarks on Jetson Orin Nano Super
r/LocalLLaMA top day48 days agoBenchmark
A LocalLLaMA post benchmarks five Bonsai LM models, from 1.7B to about 8B parameters, on a $250 Jetson Orin Nano Super 8GB using llama.cpp CUDA. The tests compare 7W, 15W, 25W, and MAXN modes across latency, throughput, energy per token, and thermals. The main takeaway is that 25W is usually the best efficiency/performance point for models up to 4B, while Bonsai-8B may favor 15W for lower power.
MooreThreads Releases MusaCoder-27B Code LLM on Hugging Face
r/LocalLLaMA top day48 days agoRelease
MooreThreads, a Chinese GPU semiconductor company best known for its MUSA compute platform, has released MusaCoder-27B on Hugging Face alongside a technical paper on arXiv. The 27B-parameter model is positioned as a code-generation LLM, extending MooreThreads' ambitions beyond hardware into the AI model layer. Its public availability on Hugging Face signals an open-weights approach, making it accessible to local-inference practitioners and researchers evaluating alternatives to Western-origin coding models.
OpenLumara Creator Challenges Reddit to Hack Its Public Agent Instance
r/LocalLLaMA top day48 days agoIncident
The creator of OpenLumara posted a public challenge asking r/LocalLLaMA users to try breaking into a Discord-hosted instance of the local-model agent. They claimed common prompt-engineering attacks would not work because modules and sandboxes were heavily locked down. The post later listed several successful findings, including missing path traversal protection, an authorization-check bypass, and another undisclosed exploit pending a fix.
Qwen3.6-MTP-27B on Tesla V100: llama.cpp Throughput Tuning Question
r/LocalLLaMA top day48 days agoBenchmark
A Reddit user is running Qwen3.6-MTP-27B-MTP in Q4_K_M GGUF format with llama.cpp server on a 32GB Tesla V100. They report one peak of 55 tokens per second, but typical throughput is closer to 44-48 TPS. The post asks whether flags such as parallelism, speculative MTP draft settings, KV cache quantization, flash attention, and a 262K context window are limiting performance without improving output quality.
How Useful Is qwopus Compared With Qwen3.6 27B for Coding?
r/LocalLLaMA top day48 days agoOpinion
A Reddit user on r/LocalLLaMA asks for practical comparisons between qwopus and Qwen3.6 27B, specifically for coding work. They note conflicting community opinions, with some users calling qwopus worse and others saying it is much better. In their own simple tests, they did not notice clear differences and want feedback from people using these models for agentic coding.
Charting Local LLM Releases: 2025 Was the Peak, Not 2026
r/LocalLLaMA top day48 days agoCommentary
A r/LocalLLaMA community member shared visualizations tracking the volume of local LLM releases over time. Contrary to the perception that 2026 has been an unusually prolific year, the data indicates the actual release peak occurred in 2025. The poster attributes the misperception to the outsized quality improvements in 2026 making it feel more eventful than it quantitatively was.
Gemma 4 12B Unified Audio Loses Speech Attention with Large System Prompts
r/LocalLLaMA top day48 days agoCommentary
A developer building a single-pass voice assistant with Gemma 4 12B unified (encoder-free audio/vision/text model) finds that audio attention collapses once the system prompt grows to ~21k tokens. The model then ignores or hallucinates instead of responding to the spoken input. The issue reproduces identically on vLLM, llama.cpp, and LiteRT-LM, pointing to an architectural attention-saturation limit rather than a stack-specific bug.
Without Open Source LLMs, US AI Companies Could Have Monopolized the Technology
r/LocalLLaMA top day48 days agoOpinion
This r/LocalLLaMA post argues that open-source LLMs are an ethical duty because AI has broad social impact. The author worries that without open models, US AI companies could have monopolized access and potentially limited availability to US firms. They also frame China’s release of powerful open-source LLMs as a contribution to humanity, despite political disagreements.
Anthropic Is Accused of Nerfing Fable for Other LLM Development
r/LocalLLaMA top day48 days agoCommentary
A r/LocalLLaMA post claims Anthropic may be intentionally limiting Fable when users ask it to help build other LLMs. The source is a short Reddit post with screenshot context, not a formal benchmark or verified disclosure. Discussion centers on trust in hosted closed models, unclear safety boundaries, and why local or open-weight LLMs may be necessary for serious AI development work.
Unsloth releases GGUF version of Cohere North-Mini-Code 1.0 (30B A3B MoE) on Hugging Face
r/LocalLLaMA top day48 days agoRelease
Unsloth uploaded a GGUF version of Cohere's North-Mini-Code 1.0 to Hugging Face, making local inference possible for this 30B A3B MoE coding-focused model. The poster links the release to llama.cpp PR #24260, suggesting new architecture support may be required. No benchmarks or test results have been shared yet; this is an early community resource post.
Anthropic Claude Fable 5: Mythos-Class Power with Controversial Terms★ 84
Latent Space48 days agoRelease
Anthropic released Claude Fable 5 as its first broadly available Mythos-class model, alongside restricted Mythos 5 access. Benchmarks and ecosystem reports show strong gains in coding, long-horizon agentic tasks, research, and vision. The controversy centers on 30-day retention for Mythos-class traffic and silent interventions that may reduce effectiveness on frontier LLM development tasks, raising trust, reproducibility, and open AI concerns.
Without open LLM competition, closed-source LLM companies will become insatiable
r/LocalLLaMA top day48 days agoOpinion
A r/LocalLLaMA user criticizes closed-source LLM providers, singling out Anthropic and its $200/month users. The post argues that without open-source model competition, proprietary AI companies could become more arrogant and less accountable to customers. The source offers little concrete context beyond an image and opinionated commentary, so it is best read as a community sentiment post rather than a verified product incident.
Releasing Apodex-1.0 Smol Models (0.8B, 2B, 4B Open-Weights) Optimized for Agentic Verification + AgentHarness Evals
r/LocalLLaMA top day48 days agoRelease
Apodex 1.0 launches with open-weight models at 0.8B, 2B, and 4B, trained not for general generation but for specialized sub-agent roles—fact-checking external claims and verifying tool call outputs before passing results to a main controller. The design targets long-horizon agent workflows where routing small tasks to lightweight models avoids wasteful use of 70B+ models at every step. AgentHarness, an open-source evaluation framework for local multi-step agent pipelines, is released alongside the weights.
Furiosa AI inference chip could be a game changer for local LLMs
r/LocalLLaMA top day48 days agoHardware
A r/LocalLLaMA post discusses Furiosa AI’s RNGD inference chip, citing TSMC 5nm, Hynix HBM3, 48GB VRAM, 1.5TB/s bandwidth, and 180W TDP. The author argues it could matter for local LLM users if Furiosa opens its programming interface and works with llama.cpp on a GGML backend. The post later clarifies Furiosa is not selling to consumers; this is a wish and market commentary, not a launch.
Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech
Hugging Face Blog48 days agoBenchmark
Code-switching—where bilingual speakers blend two languages in a single utterance—is common in markets like Taiwan, Singapore, and India, yet most ASR benchmarks focus on monolingual audio. ServiceNow AI evaluates frontier speech recognition models specifically on this mixed-language scenario. The findings help enterprise teams make informed ASR model choices when deploying voice agents for multilingual customer-facing applications.
OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
r/LocalLLaMA top day48 days agoPaper
OSCAR applies offline-precomputed rotation matrices—derived from spectral covariance analysis—to reshape KV tensor distributions before 2-bit quantization, suppressing outliers and reducing rounding error. The rotation adds negligible inference overhead since it requires no runtime learning. GGUF downloads for Gemma-4-12B-it, Qwen3-32B, and Qwen3-4B-Thinking are available, with llama.cpp and sglang integrations and an arXiv paper.
SCAIL-2: Open-Source End-to-End Character Animation Without Intermediate Pose Representations
r/LocalLLaMA top day48 days agoRelease
SCAIL-2 by zai-org removes the reliance on skeleton maps and inpainting masks common in prior character animation pipelines, driving characters directly from video in an end-to-end manner. Trained on 60K synthesized motion pairs using SCAIL-Preview, Wan-Animate, and MoCha via a Unified Motion Transfer Interface with RoPE design, the model develops emergent abilities beyond its teacher models. Capabilities include cross-identity character replacement, animal-driving scenarios, and zero-shot support for SAM3D-Body mesh rendering.
Releasing Cohere North Mini Code
r/LocalLLaMA top day48 days agoRelease
Cohere’s Jay Alammar announced the official release of North Mini Code after early community feedback from r/LocalLLaMA. Weights are available on Hugging Face, including an fp8 version, and the model can be tried for free through OpenCode. For vLLM deployment, Cohere recommends using vLLM main for now and installing cohere_melody for accurate response parsing, while noting community requests for quantization and llama.cpp support.

← PreviousPage 2Next →

Latest in AI

Benchmarking Google Eloquent Exposes Major On-Device Dictation Reliability Issues

DiffusionGemma: Google Launches High-Speed Open-Weight Gemma Diffusion Model★ 76

Google DeepMind Releases DiffusionGemma: Open Source Model with 4x Local AI Execution Speed Improvement

LocalLLaMA User Weighs QAT Gemma 31B GGUF Quants for RTX 3060

πfs: the data-free filesystem that “stores” data in π

llama.cpp Merges MTP Optimization Removing Padding and Extra D2D Copies

DiffusionGemma: 4x faster text generation★ 74

Lemonade v10.7 Adds Omni Models, Benchmarks, and Cross-Vendor GPU Support

NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI

DiffusionGemma: The Developer Guide — Google Developers Blog

DiffusionGemma: 4x Faster Text Generation★ 76

SenseNova U1 Adds an Infographic-Specific Fine-Tune

Bonsai LM 1-bit and 1.58-bit Benchmarks on Jetson Orin Nano Super

MooreThreads Releases MusaCoder-27B Code LLM on Hugging Face

OpenLumara Creator Challenges Reddit to Hack Its Public Agent Instance

Qwen3.6-MTP-27B on Tesla V100: llama.cpp Throughput Tuning Question

How Useful Is qwopus Compared With Qwen3.6 27B for Coding?

Charting Local LLM Releases: 2025 Was the Peak, Not 2026

Gemma 4 12B Unified Audio Loses Speech Attention with Large System Prompts

Without Open Source LLMs, US AI Companies Could Have Monopolized the Technology

Anthropic Is Accused of Nerfing Fable for Other LLM Development

Unsloth releases GGUF version of Cohere North-Mini-Code 1.0 (30B A3B MoE) on Hugging Face

Anthropic Claude Fable 5: Mythos-Class Power with Controversial Terms★ 84

Without open LLM competition, closed-source LLM companies will become insatiable

Releasing Apodex-1.0 Smol Models (0.8B, 2B, 4B Open-Weights) Optimized for Agentic Verification + AgentHarness Evals

Furiosa AI inference chip could be a game changer for local LLMs

Can Voice Agents Handle Bilingual Customers? Benchmarking Frontier ASR on Code-Switched Speech

OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

SCAIL-2: Open-Source End-to-End Character Animation Without Intermediate Pose Representations

Releasing Cohere North Mini Code