Latest in AI

Showing:performanceResearchersClear ×

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

Google Quietly Releases a Faster Model in Mythos’ Shadow
量子位 QbitAI47 days agoRelease
The provided QbitAI title indicates that Google released a model quietly while attention was focused on Mythos. The only concrete performance claim available is that speed increased by 4x, but the model name, task scope, benchmark method, and availability are not provided. Based on the title alone, this appears to be a model-release item relevant to developers and AI practitioners tracking latency and throughput improvements.
Profiling in PyTorch Part 2: From nn.Linear to a Fused MLP
Hugging Face Blog47 days agoTutorial
This Hugging Face Blog post appears to be a technical tutorial in a PyTorch profiling series. From the title, it focuses on analyzing performance from basic nn.Linear operations to a fused multilayer perceptron implementation. The likely audience is ML engineers and developers interested in understanding where neural network execution time goes and how kernel fusion can improve model throughput.
llama.cpp Merges MTP Optimization Removing Padding and Extra D2D Copies
r/LocalLLaMA top day47 days agoRelease
llama.cpp merged PR #24086, which changes ggml_gated_delta_net so MTP passes snapshot count K as an operation parameter instead of deriving it from tensor shape. The change removes a padding workaround and copies emitted snapshots into the recurrent cache with a single strided ggml_cpy. Benchmarks on DGX Spark with Qwen3.6-35B-A3B-UD-Q4_K_M.gguf showed about a 4% throughput gain, with wall time falling from 21.71s to 20.91s.
How much do amd64 microarchitecture levels help in Go?
Hacker News (AI keywords)50 days agoBenchmark
Daniel Lemire tests Go’s GOAMD64 levels using Roaring Bitmaps on a modern Intel Xeon. v2 brings strong gains where popcnt matters, while v3 adds further speedups in dense bitmap and set-operation workloads through AVX2. v4, despite implying AVX-512 support, shows no meaningful improvement in these benchmarks, likely due to current Go compiler limitations.
Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler
Hugging Face Blog60 days agoTutorial
Based on the title, this Hugging Face Blog post is an introductory PyTorch profiling guide focused on torch.profiler. It likely targets developers and ML engineers who need to identify training or inference bottlenecks through observable performance data. Since the full article text was not provided, implementation details, examples, and specific optimization advice cannot be confirmed.
Hugging Face 推出全新資料集串流技術：效率提升 100 倍★ 85
Hugging Face Blog274 days agoRelease
Hugging Face's official blog recently published a major update announcing a comprehensive overhaul of the streaming mode in its core open-source library…
效率化請求佇列：優化 LLM 推論效能的關鍵策略★ 75
Hugging Face Blog481 days agoTutorial
### The Unique Challenges and Memory Bottlenecks of LLM Inference Traditional web services primarily handle concurrent requests through multi-threading or…
從 Chunks 到 Blocks：Hugging Face Hub 如何大幅加速模型與數據集的上傳與下載★ 75
Hugging Face Blog531 days agoRelease
### Background and Pain Points As large language models (LLMs) have become widespread, the file sizes hosted on the Hugging Face Hub have grown dramatically…
重構 Hugging Face 的上傳與下載架構★ 75
Hugging Face Blog609 days agoRelease
Hugging Face, the world's largest open-source AI platform, currently hosts over 1.2 million models, datasets, and Space applications. With the explosion of…