Latest in AI

Showing:llm-inferenceDevelopersClear ×

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

LLM Serving Fairness: How Cohere Eliminates the Noisy Neighbour Problem
Cohere Blog40 days agoCommentary
Cohere's engineering blog addresses the "noisy neighbour" problem in multi-tenant LLM serving, where one tenant's heavy workload degrades performance for others sharing the same infrastructure. The post outlines how Cohere designs its serving layer to guarantee each tenant receives a fair and consistent share of compute resources. This is a practical look at production-grade fairness mechanisms relevant to any organisation relying on shared AI API infrastructure.
Show HN: Tiny-vLLM, a C++ and CUDA LLM Inference Engine
Hacker News (AI keywords)59 days agoNew Tool
Tiny-vLLM is a Show HN project described as a high-performance LLM inference engine implemented in C++ and CUDA. From the provided title alone, the project appears aimed at developers or ML engineers interested in GPU-accelerated local or server-side inference. No further claims about supported models, benchmarks, APIs, licensing, deployment targets, or production readiness are stated in the source.
Real-Time LLM Inference on Standard GPUs at 3k Tokens/s per Request
Hacker News (AI keywords)60 days agoBenchmark
The post’s title indicates a performance claim for real-time LLM inference on standard GPUs, reporting 3,000 tokens per second per request. No article body is available, so the underlying model, GPU type, batch size, latency profile, precision, serving stack, and benchmark method are not stated. The item is best treated as an inference-performance benchmark claim rather than a verified deployment guide.
使用 KVPress 掌握大語言模型（LLM）的長文本處理能力★ 75
Hugging Face Blog551 days agoNew Tool
In the current trajectory of large language model (LLM) development, support for long contexts has become a standard requirement. However, as input text length…
Hugging Face 與 FriendliAI 達成合作，全面加速 Hub 上的模型部署★ 70
Hugging Face Blog552 days agoRelease
Hugging Face has announced a strategic partnership with FriendliAI, a company specializing in high-performance AI inference, aimed at comprehensively improving…
透過動態投機（Dynamic Speculation）加速 Hugging Face 輔助生成（Assisted Generation）★ 75
Hugging Face Blog658 days agoRelease
Hugging Face has published a technical blog post on "Dynamic Speculation," aimed at optimizing the inference speed of large language models (LLMs)…
Hugging Face 深度整合 AMD Instinct MI300X GPU：開箱即用的開源 AI 效能優化★ 75
Hugging Face Blog798 days agoRelease
With the explosive growth of generative AI, demand for high-performance GPUs has reached an unprecedented level. To break hardware monopolies and reduce AI…
解鎖更長的文本生成：深入探討 Key-Value (KV) 快取量化技術★ 80
Hugging Face Blog803 days agoTutorial
During the inference process of large language models (LLMs), the self-attention mechanism needs to store the Key and Value vectors of historical tokens (i.e…
在 Intel® Gaudi® 2 AI 加速器上運行 Text-Generation Pipeline
Hugging Face Blog880 days agoRelease
With the explosive growth of large language models (LLMs), the demand for high-performance, cost-effective AI hardware has increased significantly. Intel Gaudi…
Hugging Face TGI (Text Generation Inference) 正式支援 AWS Inferentia2 晶片★ 75
Hugging Face Blog908 days agoRelease
Hugging Face has partnered with AWS to officially bring its widely popular open-source LLM inference optimization framework, Text Generation Inference (TGI)…
AMD 攜手 Hugging Face：推出 optimum-amd 實現 AMD GPU 的大語言模型即開即用加速★ 75
Hugging Face Blog966 days agoRelease
Hugging Face's official blog announced a deep partnership with chip giant AMD, launching `optimum-amd`, an open-source library optimized specifically for AMD…
在生產環境中優化你的大語言模型 (LLM) — Hugging Face 實戰指南★ 85
Hugging Face Blog1,047 days agoTutorial
This technical guide from Hugging Face systematically introduces the core strategies for deploying and optimizing large language models (LLMs) in production…
使用 AutoGPTQ 與 transformers 讓大型語言模型更輕量化★ 85
Hugging Face Blog1,070 days agoRelease
This Hugging Face official blog post introduces a major update that integrates AutoGPTQ into the `transformers` and `optimum` libraries. GPTQ (Generalized…
邁向加密大語言模型：利用全同態加密（FHE）實現隱私保護推論★ 75
Hugging Face Blog1,091 days agoTutorial
This blog post, co-authored by Hugging Face and Zama — a cryptography company specializing in Fully Homomorphic Encryption (FHE) — explores how to address a…
優化故事：BLOOM 超大模型推理優化實踐
Hugging Face Blog1,385 days agoTutorial
This technical blog post from Hugging Face documents in detail the practical process of optimizing inference for BLOOM, the open-source multilingual large…