Latest in AI

Showing:ResearchersOpen-sourceClear ×

← Home

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

"Fully Hallucinated Operating System" Simulates an Entire OS via LLM Prompts
r/LocalLLaMA top day50 days agoCommentary
A popular Reddit post highlights a video demonstrating a "Fully Hallucinated Operating System" run entirely inside an LLM. By prompting the model to act as a terminal, it simulates file systems, network requests, and command execution purely through text generation. While impractical for production, this experiment showcases the impressive state-tracking and "world model" capabilities of modern LLMs.
Exploring 2-bit QAT: Can Ultra-Compressed Large Models Outperform 4-bit Models Half Their Size?
r/LocalLLaMA top day50 days agoCommentary
A popular Reddit thread on r/LocalLLaMA discusses the potential of 2-bit Quantization Aware Training (QAT) for large MoE models (120B to 400B). While current QAT efforts focus on 4-bit, users speculate whether a 2-bit QAT model could fit into consumer hardware (64GB/128GB RAM) and outperform a 4-bit model of half its size. This approach is proposed as a practical alternative to training ternary (1.58-bit) LLMs from scratch.
Gemma-4-26B-A4B QAT Variant Performs Poorly in llama.cpp Compared to Non-QAT Version
r/LocalLLaMA top day51 days agoBenchmark
A LocalLLaMA user highlighted that the newly released QAT (Quantization-Aware Training) variant of Google's Gemma-4-26B-A4B model underperforms compared to its non-QAT predecessor. Testing via llama.cpp on a chessboard SVG generation task showed significant rendering errors in the QAT version. The non-QAT GGUF version, however, produced highly accurate results under identical settings.
Office-open-xml-viewer: Office XML document viewer rendering to HTML Canvas
Hacker News (AI keywords)51 days agoNew Tool
office-open-xml-viewer is an open-source browser viewer for Office Open XML documents, rendering DOCX, XLSX, and PPTX files to HTML Canvas. Its parsers are written in Rust and compiled to WebAssembly, while rendering uses the Canvas 2D API. The README also says the full codebase was implemented by Claude through iterative prompting, making it notable as an AI-assisted software development case.
Control 3D Avatars with Natural Language Using "Program as Weights" (programasweights)
r/LocalLLaMA top day51 days agoNew Tool
Developer Yuntian Deng introduced "programasweights," a framework that compiles plain-English descriptions into tiny, local action programs (loops, parallel tracks) to control 3D avatars. Instead of pre-defined buttons, users can command complex sequences like "wave while walking, then jump." The runtime code is open-source and runs entirely offline in the browser or via Python.
llama.cpp Gemma4 MTP Support Merged
r/LocalLLaMA top day51 days agoRelease
llama.cpp PR #23398 was merged on June 7, 2026, adding MTP support for Gemma4 models. The author reports over 2x average speedup on dense models, no observed speedup on MoE, and replicated AIME-26 results around 87%. Support currently covers 31B and 26B-4B variants, while E4B and E2B are not supported yet; multi-GPU may need extra draft-device configuration.
Clustering 3x Jetson Nano Orin Supers for Distributed AI
r/LocalLLaMA top day51 days agoTutorial
A developer has shared a practical guide on clustering three NVIDIA Jetson Nano Orin Super boards, leveraging their Ampere CUDA cores and unified memory. This project is part of 'smolcluster,' an initiative to make distributed AI training and inference accessible using everyday hardware like Macs, Raspberry Pis, and Jetsons. The series aims to explore whether heterogeneous clusters (mixing different hardware architectures) can effectively run local LLMs.
Persona Atlas: Mapping How Famous Minds Think
Hugging Face Blog52 days agoNew Tool
The title suggests Persona Atlas is a project focused on representing or exploring the thinking styles of famous figures. The source text is unavailable, so its format, methods, data, model use, and results cannot be verified. It may be relevant to persona modeling, AI role-play, conversational agents, or thought-style visualization, but the practical impact remains unclear without the full post.
LLM Research Papers: The 2026 List (January to May)
Ahead of AI (Raschka)52 days agoPaper
Sebastian Raschka compiles a curated reference list of LLM papers he bookmarked from January through May 2026. The list is not comprehensive, but organized around topics useful for future articles, lectures, code examples, and research work. Public sections emphasize reasoning, RL, efficient inference, long context, agent systems, tool use, coding agents, diffusion language models, and serving infrastructure.
Thousand Token Wood: shipping a multi-agent economy on a 3B model
Hugging Face Blog52 days agoTutorial
Based on the title, this Hugging Face Blog post presents Thousand Token Wood, a project shipping a multi-agent economy on a 3B model. The likely focus is practical system design under small-model constraints, rather than a new frontier-scale model release. Without the original text, details such as the exact model, architecture, benchmarks, code availability, and results cannot be confirmed.
Hermes Agent – Open-source AI agent with persistent memory
Hacker News (AI keywords)52 days agoNew Tool
Hermes Agent is an open-source autonomous agent by Nous Research, designed to run on your own server or machine with persistent local memory. It offers messaging gateways, scheduled automations, browser control, parallel sub-agents, reusable skills, and multiple LLM provider options. The project also targets MLOps and research workflows, including tool-calling trajectory generation, RL experiments, and exportable fine-tuning data.
Tiny hackable CUDA language model implementation
Hacker News (AI keywords)53 days agoNew Tool
This GitHub project implements a compact generative pretrained transformer as an autoregressive byte-level sequence model. Its README describes causal self-attention, RoPE, feed-forward layers, AdamW, cross-entropy training, and BLAS/OpenBLAS-backed matrix operations, with CUDA toolkit listed in setup steps. It is most useful as an educational and experimental codebase, not as a production-grade replacement for large commercial LLMs.
Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency★ 72
Hacker News (AI keywords)53 days agoRelease
Google released new Gemma 4 checkpoints optimized with Quantization-Aware Training to preserve quality after compression. The release includes Q4_0 checkpoints and a mobile-focused quantization format that can reduce Gemma 4 E2B memory use to about 1GB, or below 1GB for a text-only configuration. The models are available through Hugging Face and supported across llama.cpp, Ollama, LM Studio, LiteRT-LM, Transformers.js, SGLang, vLLM, MLX, and Unsloth.
Arithmetic Without Numbers: How LLMs Do Math
Hacker News (AI keywords)53 days agoCommentary
The article asks whether LLM arithmetic is memorization, heuristics, real computation, or experimental assistance. It summarizes Rune experiments that decode operations and operands from frozen Llama activations, then route them to Python under a no-parser rule. The strongest supported claim is narrow: activation-derived tool arguments worked in scoped audits, while residual-state JIT replacement, long-number generation, and cross-model transfer remain brittle.
Fine-tuning an LLM to write docs like it's 1995
Hacker News (AI keywords)53 days agoTutorial
The author builds a corpus from old Microsoft manuals, cleans OCR text, generates instruction-style JSONL examples, and fine-tunes Llama 3.1 8B and Qwen 2.5 7B with QLoRA. Tests cover malloc(), a fictional Win32 API, and a deliberately anachronistic REST API prompt. Qwen fine-tunes transfer the period documentation style best, but the experiment also shows hallucination risks, tuning complexity, and why these models augment rather than replace technical writers.
Magenta RealTime 2: An Open, Locally Runnable Real-Time Music Model★ 74
Hacker News (AI keywords)53 days agoRelease
Magenta RealTime 2 is an open-weights live music model designed for interactive performance rather than offline prompt-to-song generation. It supports real-time control through MIDI, audio, and text, and can run as standalone apps, DAW plugins, or embedded music software. Google Magenta also released a Python library, C++ MLX inference engine, models, and example applications for musicians and developers.
Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining
Hugging Face Blog54 days agoTutorial
The post appears to focus on generating synthetic Q&A data from task seeds for Nemotron pretraining. Rather than a model launch, it likely emphasizes data generation and pretraining corpus design. Because the original article text is unavailable here, concrete claims about dataset scale, benchmarks, or implementation details should not be inferred.
Reve 2 and Ideogram 4: Layouts in Imagegen
Latent Space54 days agoRelease
Latent Space’s roundup frames image composition as a major barrier now being tackled by layout-aware image models. Reve 2.0 emphasizes precise generation and editing with layouts, while Ideogram 4.0 uses bounding boxes tied to region descriptions. The issue also covers MAI-Thinking-1, Gemma 4 12B, open audio models, agent execution layers, and model-routing cost debates.
Designing the hf CLI as an agent-optimized way to work with the Hub
Hugging Face Blog54 days agoCommentary
Based only on the title, this Hugging Face post appears to explain how the hf CLI is being designed for AI agents working with the Hub. It likely focuses on command-line ergonomics, automation, and predictable interactions with Hub resources. Without the full text, specific features, supported agents, or implementation details should not be inferred.
How LLMs Actually Work
Hacker News (AI keywords)54 days agoTutorial
The article explains how modern LLMs convert text into token IDs, embeddings, and position-aware vectors before passing them through stacked transformer blocks. It covers attention, multi-head attention, KV cache, GQA, feed-forward networks, MoE, residual streams, normalization, and decoding. Its goal is educational: helping readers understand the common architecture behind many current model families and read model cards or papers more confidently.
Google's Gemma 4 12B is designed to run on 16GB RAM laptops
Ars Technica AI55 days agoRelease
Google introduced Gemma 4 12B, an open model aimed at running locally on laptops with 16GB of RAM. The model uses a new encoding scheme and token prediction to improve efficiency relative to its size. Its practical importance depends on real-world benchmarks, but it could lower the barrier for private, offline, and local multimodal AI workflows.
Direct Preference Optimization Beyond Chatbots
Hugging Face Blog55 days agoTutorial
Based only on the title, this Hugging Face Blog post appears to discuss Direct Preference Optimization outside conventional chatbot use cases. It may frame DPO as a broader preference-alignment method for model outputs, workflows, or non-conversational AI systems. Without the full article, specific claims about experiments, datasets, models, or implementation details cannot be verified.
Farewell Ai2
Interconnects (Nathan L.)56 days agoCommentary
Nathan L. says this was his final week at the Allen Institute for AI (Ai2). He highlights the privilege of working on the Olmo models and describes the role as a period of growth and learning. The brief farewell post does not provide a reason for leaving, future plans, or details about any impact on Olmo development.
NVIDIA Cosmos 3: An Open Omni-model for Physical AI Reasoning and Action
Hugging Face Blog57 days agoRelease
Hugging Face Blog announces NVIDIA Cosmos 3, described as the first open omni-model for Physical AI reasoning and action. The title indicates a focus on AI systems that interact with physical-world scenarios rather than only text generation. Because the article body was not provided, its architecture, supported modalities, license, downloadable assets, benchmarks, and deployment requirements cannot be verified from the available material.
I Am Retiring from Tech to Live Offline
Simon Willison's Weblog58 days agoEthics
Simon Willison highlights Chad Whitacre’s decision to leave tech and Open Source, framed not as a forum threat but as concrete action. Whitacre describes wanting to become “AI Amish” or “Internet Amish,” moving toward an offline, analog life closer to 1980 than 1780. A previous post about using Claude Code with Opus 4.5 shows how agentic AI felt intoxicating and unsettling enough to push him away from technological accelerationism.
Show HN: Tiny-vLLM, a C++ and CUDA LLM Inference Engine
Hacker News (AI keywords)59 days agoNew Tool
Tiny-vLLM is a Show HN project described as a high-performance LLM inference engine implemented in C++ and CUDA. From the provided title alone, the project appears aimed at developers or ML engineers interested in GPU-accelerated local or server-side inference. No further claims about supported models, benchmarks, APIs, licensing, deployment targets, or production readiness are stated in the source.
CAPTCHAs can still detect AI agents★ 72
Hacker News (AI keywords)60 days agoPaper
Roundtable argues that CAPTCHA image recognition is largely solved, but process-level behavior still separates humans from AI agents. Their CogCAPTCHA30 benchmark combines CAPTCHA with cognitive psychology tasks to test not only outputs, but how answers are produced. Results suggest frontier models like Claude, GPT, and Gemini are not necessarily more humanlike than smaller or cognition-trained models.
ESMFold2: The Bitter Lesson Is Coming for Proteins★ 74
Latent Space62 days agoCommentary
Latent Space interviews Biohub’s Alex Rives about ESMFold2 and the broader ESM protein modeling stack. The discussion centers on datasets versus inductive bias, and whether protein biology is entering its own Bitter Lesson era. The key implication is that large-scale evolutionary sequence data and open models may become foundations for structure prediction, interaction modeling, and programmable biology.
Reachy Mini goes fully local
Hugging Face Blog62 days agoHardware
Hugging Face published a tutorial for running Reachy Mini conversations without cloud audio processing or API keys. The setup uses its speech-to-speech library as a cascaded VAD, STT, LLM, and TTS pipeline exposed through a Realtime API-compatible WebSocket. Recommended defaults include llama.cpp with Gemma 4, Silero VAD, Parakeet-TDT, and Qwen3-TTS, while allowing swaps to vLLM, MLX, Transformers, or hosted Responses API providers.
Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL
Hugging Face Blog62 days agoTutorial
Based on the title, this Hugging Face Blog post focuses on Delta Weight Sync in TRL. It likely discusses moving or synchronizing weight differences at very large model scale using a Hub bucket-related workflow. Without the full article, implementation details, benchmarks, APIs, and stability claims cannot be confirmed.

← PreviousPage 5Next →

Latest in AI

"Fully Hallucinated Operating System" Simulates an Entire OS via LLM Prompts

Exploring 2-bit QAT: Can Ultra-Compressed Large Models Outperform 4-bit Models Half Their Size?

Gemma-4-26B-A4B QAT Variant Performs Poorly in llama.cpp Compared to Non-QAT Version

Office-open-xml-viewer: Office XML document viewer rendering to HTML Canvas

Control 3D Avatars with Natural Language Using "Program as Weights" (programasweights)

llama.cpp Gemma4 MTP Support Merged

Clustering 3x Jetson Nano Orin Supers for Distributed AI

Persona Atlas: Mapping How Famous Minds Think

LLM Research Papers: The 2026 List (January to May)

Thousand Token Wood: shipping a multi-agent economy on a 3B model

Hermes Agent – Open-source AI agent with persistent memory

Tiny hackable CUDA language model implementation

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency★ 72

Arithmetic Without Numbers: How LLMs Do Math

Fine-tuning an LLM to write docs like it's 1995

Magenta RealTime 2: An Open, Locally Runnable Real-Time Music Model★ 74

Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining

Reve 2 and Ideogram 4: Layouts in Imagegen

Designing the hf CLI as an agent-optimized way to work with the Hub

How LLMs Actually Work

Google's Gemma 4 12B is designed to run on 16GB RAM laptops

Direct Preference Optimization Beyond Chatbots

Farewell Ai2

NVIDIA Cosmos 3: An Open Omni-model for Physical AI Reasoning and Action

I Am Retiring from Tech to Live Offline

Show HN: Tiny-vLLM, a C++ and CUDA LLM Inference Engine

CAPTCHAs can still detect AI agents★ 72

ESMFold2: The Bitter Lesson Is Coming for Proteins★ 74

Reachy Mini goes fully local

Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL