Latest in AI

Showing:evaluationProductClear ×

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

FrontierCode: Benchmarking for Code Quality over Slop
Latent Space49 days agoBenchmark
Latent Space briefly announced FrontierCode with the line “We made a thing!” From the title, FrontierCode appears to be a benchmark for frontier coding systems that prioritizes code quality rather than sheer code generation volume. The provided excerpt does not include methodology, model results, datasets, or tooling details, so conclusions should remain cautious.
Introducing Search Toolkit★ 72
Mistral AI News50 days agoNew Tool
Mistral AI introduced Search Toolkit in public preview as a composable framework for AI search infrastructure. It unifies ingestion, retrieval, and evaluation with support for parsing, chunking, embeddings, BM25, dense retrieval, hybrid search, and standard retrieval metrics. The toolkit targets enterprise search, RAG quality improvement, and domain-specific retrieval, with a starter app using Docker, uv, and Vespa.
ITBench-AA: Frontier Models Score Below 50% on Enterprise IT Tasks★ 72
Hugging Face Blog61 days agoBenchmark
Artificial Analysis and IBM present ITBench-AA, described in the title as the first benchmark for agentic enterprise IT tasks. The headline result is that frontier models score below 50%, suggesting current systems still struggle with enterprise-grade agent workflows. The original article text is unavailable here, so task design, evaluated models, scoring methodology, and rankings cannot be confirmed.