AI/ML

Claude Opus 4.6 Built a C Compiler From Scratch — Here's How

Anthropic researcher Nicholas Carlini used 16 Claude Opus 4.6 agents to build a 100,000-line C compiler in Rust that compiles the Linux 6.9 kernel, passes 99% of GCC torture tests, and runs Doom. Total cost: $20,000 in two weeks.

Admin
·
February 18, 2026
·
9 min read
Claude Opus 4.6 Built a C Compiler From Scratch — Here's How

Claude Opus 4.6 Built a C Compiler From Scratch — Here's How

In February 2026, Anthropic researcher Nicholas Carlini gave 16 instances of Claude Opus 4.6 a task: build a complete C compiler from scratch, in Rust, with minimal human intervention. Two weeks and $20,000 in API costs later, the agents delivered a 100,000-line Rust codebase that compiles the Linux 6.9 kernel across x86, ARM, and RISC-V architectures, passes 99% of the GCC torture test suite, and runs Doom.

This is not a demo or a toy project. The compiler handles production open-source software: QEMU, FFmpeg, SQLite, PostgreSQL, and Redis all compile and run. It represents the largest autonomous software engineering project ever completed by AI agents — and it reveals both the extraordinary capabilities and the concrete limitations of current AI coding tools.

What Claude Opus 4.6 Brings to the Table

Claude Opus 4.6, released February 5, 2026, is Anthropic's most capable model for extended reasoning and complex multi-step workflows. Before diving into the compiler project, understanding the model's specifications explains why it was up to this task.

SpecificationDetail
Context Window1 million tokens (beta)
Max Output128K tokens per response
Input Pricing$5 per million tokens
Output Pricing$25 per million tokens
Key FeaturesAgent teams, adaptive thinking, effort parameter
AvailabilityClaude API, Amazon Bedrock, Google Vertex AI, Microsoft Foundry

Opus 4.6 introduced "agent teams" for improved AI coding collaboration and autonomous task execution — exactly the capability Carlini exploited. The model also features "adaptive thinking" and an "effort parameter" to control reasoning depth, allowing agents to allocate more compute to harder problems.

On benchmarks, Opus 4.6 scored 80.8% on SWE-bench Verified (the highest of any model at the time), outperformed OpenAI's GPT-5.2 on Terminal-Bench 2.0 for agentic coding, and beat GPT-5.2 on Humanity's Last Exam, a multi-disciplinary reasoning test.

How 16 AI Agents Built a Compiler

Carlini, a member of Anthropic's Safeguards team, deployed 16 parallel Claude Opus 4.6 instances working on a shared repository. Each agent functioned as a specialized software engineer, responsible for different phases of compiler construction. The agents communicated through structured prompts and shared code, mimicking a distributed engineering team.

The project consumed nearly 2,000 Claude Code sessions over two weeks. Each session focused on iterative refinement: writing one component, testing it, fixing failures, and integrating with work from other agents. The $20,000 API bill reflects the volume of tokens processed — each session involved deep reasoning across thousands of lines of context.

What makes this remarkable is how little human intervention was needed. Carlini set up the framework, defined the goals, and let the agents execute. This is qualitatively different from a developer writing code with AI assistance; here, the AI agents were doing the engineering autonomously.

The Compiler Pipeline

Building a C compiler requires multiple interconnected systems. Here is how the agents divided the work:

Lexer and Parser (Frontend). Initial agents built a recursive descent parser in Rust targeting C99 standards. They handled preprocessor directives, type inference, and the hundreds of edge cases that make C parsing notoriously difficult. Public compiler specifications served as reference material.

Intermediate Representation and Optimization. Middle agents implemented IR transformations and peephole optimizations — the phase where raw parsed code gets transformed into more efficient representations. The agents used Rust crates like nom for parsing, though notably did not integrate LLVM.

Backend Code Generation. Backend agents targeted three architectures: x86, ARM, and RISC-V. They generated assembly that GCC's assembler and linker could process. This is a key design decision — rather than building a complete standalone toolchain, the compiler relies on GCC's low-level tools for the final linking step.

Testing and Integration. Final agents ran the GCC torture test suite — over 20,000 edge-case tests designed to break compilers — achieving a 99% pass rate. They also ran custom benchmarks against real-world projects.

What the Compiler Can Actually Build

The test results speak for themselves:

ProjectStatusWhy It Matters
Linux 6.9 kernel✅ Compiles on x86, ARM, RISC-VKernel builds test the deepest corners of C compliance
QEMU✅ Compiles and runsFull system emulator, tests virtualization code paths
FFmpeg✅ Compiles and runsMultimedia processing, tests SIMD and optimization paths
SQLite✅ Compiles and runsDatabase engine, tests transaction logic
PostgreSQL✅ Compiles and runsComplex query optimizer, tests deep C features
Redis✅ Compiles and runsIn-memory store, tests networking stacks
Doom✅ Compiles and runsClassic game engine, tests graphics and game loop code
GCC torture suite99% pass rate20,000+ adversarial edge-case tests

Compiling the Linux kernel is particularly impressive. Kernel code uses every dark corner of C: inline assembly, GCC extensions, architecture-specific intrinsics, complex macro systems, and build system quirks that have broken even established compilers.

Honest Assessment: Where It Falls Short

The compiler has real limitations that matter:

No standalone toolchain. It lacks its own assembler and linker, relying on GCC for those final steps. This means it cannot produce executables without GCC installed — fundamentally limiting its use as a drop-in replacement.

Code quality lags behind GCC. Generated binaries are less efficient than code compiled by GCC even with optimizations disabled. This stems from the AI's pattern-matching approach to optimization, which approximates transforms rather than applying provably optimal algorithms.

The 1% failure rate matters. In production compilers, even a 0.01% miscompilation rate is unacceptable. A 1% failure on the torture test suite means approximately 200+ edge cases where the compiler produces incorrect code. For safety-critical software, this alone disqualifies it.

AI hallucinations during development. Carlini reported that agents occasionally proposed invalid intermediate representations or overlooked undefined behavior. Human oversight was still necessary to catch these errors, raising questions about fully autonomous compiler development.

The $20,000 Question: Is This Cost-Effective?

The project cost approximately $20,000 in Anthropic API usage over two weeks. For context:

  • Hiring a team of senior compiler engineers to build an equivalent tool from scratch would take years and cost millions.
  • An individual senior compiler engineer earns $200,000–$400,000 annually but still could not replicate this scope in two weeks.
  • The output is 100,000 lines of functional Rust code — roughly $0.20 per line.

For prototyping and exploration, $20,000 is remarkably cheap. For production use, the limitations mean additional investment would be needed to reach GCC parity.

How This Compares to Other AI Coding Projects

No other AI system has publicly demonstrated compiler-scale autonomous engineering.

Devin (Cognition Labs) promises end-to-end software engineering but has not shown outputs at this scale. OpenAI's GPT-5.3 Codex excels at terminal and coding benchmarks but targets interactive pair-programming, not autonomous multi-agent construction. EleutherAI's open models compile code snippets, not kernels.

Claude's multi-agent approach — 16 coordinated instances working on a shared repo — differs fundamentally from single-model code generation. It enables division of labor: some agents focus on parsing while others handle optimization, the same way human engineering teams operate. This parallelism is currently unique to Claude's agent teams feature.

What This Means for Software Engineering

For developers: Study the Rust codebase (when available through Anthropic's published reports) to understand how LLMs structure compiler components. The patterns for building lexers, parsers, and code generators with AI assistance are directly applicable to custom tooling projects.

For businesses: AI agent teams can now produce production-quality prototypes in weeks rather than months. Expect this approach to spread to firmware development, driver development, and domain-specific compilers. The $20,000 cost point makes experimental compiler creation accessible to any engineering team.

For the industry: The shift from "AI assists coding" to "AI teams build entire systems" is a qualitative change. The limiting factor is no longer whether AI can write code — it is whether AI-written code can be trusted in production without human audit.

Frequently Asked Questions

How many lines of code did Claude Opus 4.6 produce?

The AI agents produced approximately 100,000 lines of Rust code across the complete compiler toolchain, spanning lexer, parser, optimizer, and code generators for three CPU architectures.

Can the compiler replace GCC?

Not yet. It lacks its own assembler and linker, generates slower code than GCC, and has a 1% failure rate on the torture test suite. It complements rather than replaces GCC.

How much did the project cost?

Approximately $20,000 in Anthropic API usage over two weeks and nearly 2,000 Claude Code sessions with 16 parallel Opus 4.6 instances.

Did a human write any of the compiler code?

Nicholas Carlini set up the project framework and goals but the 16 AI agents performed the actual engineering with minimal human intervention. Carlini provided oversight to catch hallucinations.

What architectures does it support?

The compiler generates code for x86, ARM, and RISC-V architectures, covering the three most important instruction set families in computing.

Is the compiler code publicly available?

Details have been shared through Anthropic, InfoQ, The Register, and Analytics Vidhya. Check original reports for links to the Rust codebase.

This project is a proof of concept that will look modest in retrospect. If Opus 4.7 can close the performance gap with GCC and eliminate the dependency on GCC's assembler, the implications for software engineering are enormous. Watch for open-sourcing of the compiler, expansion of agent orchestration tools, and similar experiments targeting C++, Java, and Go compilers.

to like, save, and get personalized recommendations

Comments (0)

Loading comments...