AI Coding Benchmark: Systems Languages
A rigorous benchmark for evaluating AI coding assistants on Go and Rust — where the compiler is the judge.
Why this benchmark?
Most AI coding evaluations are conducted in Python or JavaScript, where “close enough” often runs. In systems programming, the compiler is unforgiving. We test whether AI models can produce code that doesn’t just look right, but actually builds and adheres to strict architectural patterns.
The Challenge
We evaluate models across four rigor levels in Go and Rust. From basic Clean Architecture patterns to “Nightmare” scenarios involving complex dependency injection graphs (Uber Fx) and compile-time macro expansions (Diesel).
The goal is simple: The code must compile, run, and pass strict unit tests. No hallucinations allowed.
Leaderboard
( Click on name to get more details about the results )
| # | Model | Go Med | Go Night | Rust Med | Rust Night | Avg |
|---|---|---|---|---|---|---|
| 🥇 | Claude 3.5 Sonnet 20241022 | 90 | 75 | 85 | 30 | 70 |
| 🥈 | Gemini 1.5 Pro 002 | 88 | 65 | 75 | 25 | 63.25 |
| 🥉 | GPT-4o 2024-05-13 | 85 | 60 | 70 | 20 | 58.75 |
Challenge Details
🐹 Go Challenges
- Medium: Clean Architecture Tests basic Go web service patterns using Gin and GORM. The focus is on manual dependency injection and proper separation of concerns without framework magic.
- Nightmare: Fx Dependency Graph A test of architectural coherence. AI must wire a microservice using Uber Fx, handling cryptic lifecycle errors and complex dependency graphs that often lead to “hallucinated” annotations.
🦀 Rust Challenges
- Medium: Async All The Way Down Using Axum and SQLx, this challenge tests async handling, connection pooling, and strict type mappings. It requires proper implementation of async traits and error handling.
- Nightmare: Diesel Macro Magic
The ultimate test of precision. AI must manually write
diesel::table!macros and implement complex traits for custom types (like Enums) to satisfy the Diesel ORM’s compile-time checks.
Note: Results are updated periodically as new models are released and re-evaluated.