AI Coding Benchmark: Systems Languages

A rigorous benchmark for evaluating AI coding assistants on Go and Rust — where the compiler is the judge.

Why this benchmark?

Most AI coding evaluations are conducted in Python or JavaScript, where “close enough” often runs. In systems programming, the compiler is unforgiving. We test whether AI models can produce code that doesn’t just look right, but actually builds and adheres to strict architectural patterns.

The Challenge

We evaluate models across four rigor levels in Go and Rust. From basic Clean Architecture patterns to “Nightmare” scenarios involving complex dependency injection graphs (Uber Fx) and compile-time macro expansions (Diesel).

The goal is simple: The code must compile, run, and pass strict unit tests. No hallucinations allowed.

Leaderboard

( Click on name to get more details about the results )

#	Model	Go Med	Go Night	Rust Med	Rust Night	Avg
🥇	Claude 3.5 Sonnet 20241022	90✓	75✓	85✓	30✗	70
🥈	Gemini 1.5 Pro 002	88✓	65✓	75✓	25✗	63.25
🥉	GPT-4o 2024-05-13	85✓	60✗	70✓	20✗	58.75

Last updated: February 12, 2025

Challenge Details

🐹 Go Challenges

Medium: Clean Architecture Tests basic Go web service patterns using Gin and GORM. The focus is on manual dependency injection and proper separation of concerns without framework magic.
Nightmare: Fx Dependency Graph A test of architectural coherence. AI must wire a microservice using Uber Fx, handling cryptic lifecycle errors and complex dependency graphs that often lead to “hallucinated” annotations.

🦀 Rust Challenges

Medium: Async All The Way Down Using Axum and SQLx, this challenge tests async handling, connection pooling, and strict type mappings. It requires proper implementation of async traits and error handling.
Nightmare: Diesel Macro Magic The ultimate test of precision. AI must manually write diesel::table! macros and implement complex traits for custom types (like Enums) to satisfy the Diesel ORM’s compile-time checks.